位置:首页 > 九章学会Hive - Hive的语法 >

九章学会Hive - Hive的语法

作者:小牛君|发布时间:2017-06-16

小牛学堂的课程大纲最近进行了再一次升级,并且同时推出Java大数据平台开发班、Python爬虫与数据挖掘班、Spark项目班、Spark大神班、机器学习算法实战班、BI数据分析实战班, 目前这类人群凤毛麟角,导致这个行业的平均薪资极高,为此小牛学堂集合了行业的诸多大牛开设对应班级,为想学习的同学提供机会!
如果想了解详细情况,请联系 今日值班讲师 或者直接加入千人QQ群进行咨询:210992946

以下是本文正文:


1.  Hive语法

5.0 database

DATABASESCHEMA是一样的意思

创建数据库

CREATE (DATABASE|SCHEMA) [IF NOT EXISTS] database_name

  [COMMENT database_comment]

  [LOCATION hdfs_path]

  [WITH DBPROPERTIES (property_name=property_value, ...)];

例:

create database financials

comment 'test'

location '/hdfs/financials'

with dbproperties('date'='2017-03-31');

查看数据库描述

desc database financials;

desc database extended financials;

查看数据库列表

show databases;

show databases like 't.*';

删除数据库

DROP (DATABASE|SCHEMA) [IF EXISTS] database_name [RESTRICT|CASCADE];

例:

drop database financials;

drop database if exists financials;

先删除数据库中的表再删数据库

drop database test cascade;

drop database if exists financials cascade;

修改数据库属性

ALTER (DATABASE|SCHEMA) database_name SET DBPROPERTIES (property_name=property_value, ...);   -- (Note: SCHEMA added in Hive 0.14.0)

例:

alter database financials set dbproperties('date'='today');

 

ALTER (DATABASE|SCHEMA) database_name SET OWNER [USER|ROLE] user_or_role;

 

 

 

 

 

 

 

查看当前数据库所有的表

show tables;

查看指定数据库所有的表

show tables in test;

 

 

 

查看表结构

desc t1.id;

desc t1;

desc extended t1;

desc formatted  t1;

查看表分区

show partitions tblName;

 

 

查看表的创建信息

show create table test;

CREATE TABLE `test`(

    `line` string)

ROW FORMAT SERDE

    'org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe'

STORED AS INPUTFORMAT

  'org.apache.hadoop.mapred.TextInputFormat'

OUTPUTFORMAT

    'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'

LOCATION

    'hdfs://mini1:9000/hdfs/clean'

TBLPROPERTIES (

    'COLUMN_STATS_ACCURATE'='false',

    'column_stats_accurate'='true',

    'numFiles'='0',

    'numRows'='-1',

    'numfiles'='1',

    'numrows'='0',

    'rawDataSize'='-1',

    'rawdatasize'='0',

    'totalSize'='0',

    'totalsize'='246',

    'transient_lastDdlTime'='1490851050',

    'transient_lastddltime'='1490798365')

 

5.1DDL

重命名表名

alter table tblname rename to new_tblname

例:

alter table t1 rename to t2;

 

给表增加一个字段:

alter table tblName add columns (columnName type comment 'info');

例:

alter table t1 add columns (col1 string);

 

 

修改某一个字段:

alter table table_name change

[cloumn] col_old_name col_new_name column_type

[conmment col_conmment]

[first|after column_name];

eg.

alter table tblname change colname new_colname new_type;

例:

alter table t1 change name stu_name string;

age 放到第一列

alter table t1 change age age int first;

age 放在某一列的后面

alter table t1 change age age int after id;

 

 

替换所有的字段类型为指定的类型列表:

alter table tblname replace columns (col_spec[, col_spec ...])

例:

alter table t1 replace columns (

  t_id int comment 't_id'

);

 

修改表位置,(只能是hdfs上的全路径)

alter table t6_external set location 'hdfs://mini1:9000/data/ttt';

 

 

 

5.2创建表

创建表

create table if not exists t1(id int);

根据其他表模式创建表

create table if not exists t1 like test.t_dept;

查看当前数据库所有test开头的表

show tables 'test.*';

注:in db和表名正则匹配不能同时使用

 

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name    -- (Note: TEMPORARY 0.14.0以后版本可用)

  [(col_name data_type [COMMENT col_comment], ... [constraint_specification])]

  [COMMENT table_comment]

  [PARTITIONED BY (col_name data_type [COMMENT col_comment], ...)]

  [CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name [ASC|DESC], ...)] INTO num_buckets BUCKETS]

  [SKEWED BY (col_name, col_name, ...)]                  -- (Note: 0.10.0以后版本可用)

     ON ((col_value, col_value, ...), (col_value, col_value, ...), ...)

     [STORED AS DIRECTORIES]

  [

   [ROW FORMAT row_format]

   [STORED AS file_format]

     | STORED BY 'storage.handler.class.name' [WITH SERDEPROPERTIES (...)]  -- (Note: 0.6.0以后版本可用)

  ]

  [LOCATION hdfs_path]

  [TBLPROPERTIES (property_name=property_value, ...)]   -- (Note: Available in Hive 0.6.0 and later)

  [AS select_statement];   -- (Note: Available in Hive 0.5.0 and later; not supported for external tables)

 

CREATE [TEMPORARY] [EXTERNAL] TABLE [IF NOT EXISTS] [db_name.]table_name

  LIKE existing_table_or_view_name

  [LOCATION hdfs_path];

 

data_type

  : primitive_type

  | array_type

  | map_type

  | struct_type

  | union_type  -- (Note: Available in Hive 0.7.0 and later)

 

primitive_type

  : TINYINT

  | SMALLINT

  | INT

  | BIGINT

  | BOOLEAN

  | FLOAT

  | DOUBLE

  | DOUBLE PRECISION -- (Note: Available in Hive 2.2.0 and later)

  | STRING

  | BINARY      -- (Note: Available in Hive 0.8.0 and later)

  | TIMESTAMP   -- (Note: Available in Hive 0.8.0 and later)

  | DECIMAL     -- (Note: Available in Hive 0.11.0 and later)

  | DECIMAL(precision, scale)  -- (Note: Available in Hive 0.13.0 and later)

  | DATE        -- (Note: Available in Hive 0.12.0 and later)

  | VARCHAR     -- (Note: Available in Hive 0.12.0 and later)

  | CHAR        -- (Note: Available in Hive 0.13.0 and later)

array_type

  : ARRAY < data_type >

map_type

  : MAP < primitive_type, data_type >

struct_type

  : STRUCT < col_name : data_type [COMMENT col_comment], ...>

union_type

   : UNIONTYPE < data_type, data_type, ... >  -- (Note: Available in Hive 0.7.0 and later)

 

row_format

  : DELIMITED [FIELDS TERMINATED BY char [ESCAPED BY char]] [COLLECTION ITEMS TERMINATED BY char]

        [MAP KEYS TERMINATED BY char] [LINES TERMINATED BY char]

        [NULL DEFINED AS char]   -- (Note: Available in Hive 0.13 and later)

  | SERDE serde_name [WITH SERDEPROPERTIES (property_name=property_value, property_name=property_value, ...)]

 

file_format:

  : SEQUENCEFILE

  | TEXTFILE    -- (Default, depending on hive.default.fileformat configuration)

  | RCFILE      -- (Note: Available in Hive 0.6.0 and later)

  | ORC         -- (Note: Available in Hive 0.11.0 and later)

  | PARQUET     -- (Note: Available in Hive 0.13.0 and later)

  | AVRO        -- (Note: Available in Hive 0.14.0 and later)

  | INPUTFORMAT input_format_classname OUTPUTFORMAT output_format_classname

 

constraint_specification:

  : [, PRIMARY KEY (col_name, ...) DISABLE NOVALIDATE ]

    [, CONSTRAINT constraint_name FOREIGN KEY (col_name, ...) REFERENCES table_name(col_name, ...) DISABLE NOVALIDATE

 

例:

create table t2(

  id int comment 't2\'s ID',

  stu_name string comment 'name',

  stu_birthday date comment 'birthday',

  online boolean comment 'is online'

) row format delimited

fields terminated by '\t'

lines terminated by '\n';

 

create table t3_stu_hobby(

  id int comment 'ID',

  stu_name string comment 'stu name',

  stu_hobby array<string> comment "stu's hobby"

) row format delimited

fields terminated by '\t'

collection items terminated by ',';

 

create table t4_stu_scores(

  id int comment 'ID',

  stu_name string comment 'stu name',

  stu_scores map<string, int> comment "stu's scores"

) row format delimited

fields terminated by '\t'

collection items terminated by ','

map keys terminated by ':';

 

create table t5_staff_address (

  id int comment 'ID',

  staff_name string comment 'staff name',

  staff_address struct<home_addr:string, office_addr:string> comment "staff's address"

) row format delimited

fields terminated by '\t'

collection items terminated by ',';

 

create table employees (

  id int,

  salary float,

  subordinates array<string>,

  deductions map<string, float>,

  address struct<city:string,province:string,zip:int>

) row format delimited

fields terminated by '\001'

collection items terminated by '\002'

map keys terminated by '\003'

lines terminated by '\n';

 

 

create <external> table tblName (字段 类型...) location hdfs_uri;

例:

create external table t6_external (

  id int,

  name string,

  birthday date,

  online boolean )

row format delimited fields terminated by '\t' location '/hdfs/input/';

 

RegEx

CREATE   TABLE logs(

  host STRING,

  identity STRING,

  username STRING,

  time STRING,

  request STRING,

  status STRING,

  size STRING,

  referer STRING,

  agent STRING)

ROW   FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH   SERDEPROPERTIES (

"input.regex"   = "([^ ]*) ([^ ]*) ([^ ]*) (\\[.*\\]) \"(.*?)\" (-|[0-9]*)   (-|[0-9]*) \"(.*?)\" \"(.*?)\"",

"output.format.string"   = "%1$s %2$s %3$s %4$s %5$s %6$s %7$s %8$s %9$s"

)

STORED   AS TEXTFILE;

例:

CREATE TABLE logs(

  host STRING,

  identity STRING,

  username STRING,

  time STRING,

  request STRING,

  status STRING,

  size STRING,

  referer STRING,

  agent STRING)

ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.RegexSerDe'

WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) \\[(.*)\\] \"(.*?)\" (-|[0-9]*) (-|[0-9]*) \"(.*?)\" \"(.*?)\""

)

STORED AS TEXTFILE;

 

load data local inpath 'hive-demo/access.log' into table logs;

select * from logs limit 10;

 

select substring(time,0,14) datetime ,host, count(*) as count

from logs

group by substring(time,0,14), host

having count > 100;

Create Table As Select (CTAS)

CTAS has these restrictions:

The target table cannot be a partitioned table.

The target table cannot be an external table.

The target table cannot be a list bucketing table.

目标表不能是分区表,外部表或桶表

Example:

CREATE TABLE new_key_value_store

   ROW FORMAT SERDE "org.apache.hadoop.hive.serde2.columnar.ColumnarSerDe"

   STORED AS RCFile

   AS

SELECT (key % 1024) new_key, concat(key, value) key_value_pair

FROM key_value_store

SORT BY new_key, key_value_pair;

 

RCFile  orc  parquet

RCFILE 是一种行列存储相结合的存储方式。首先,其将数据按行分块,保证同一个

record 在一个块上,避免读一个记录需要读取多个 block。其次,块数据列式存储,有

利于数据压缩和快速的列存取。

 

Orc 文件格式是对 RCFile 的一个扩展和升级,拥有更高效的压缩比和执行效率。

 

Parquet 是新起的一种列式存储文件类型

 

创建方式

create table t1

   stored as orc

   as

   select * from t3;

  

Hive ACID(orc格式的分桶表支持)

create table test(id int, name string)

clustered by (id) into 3 buckets

stored as orc

tblproperties('transactional'='true');

插入数据:

insert into table test values(1, 'zhangsan'),(2, 'lisi'),(3, wangwu');

 

update test set name='zhaoliu' where id=1;

 

删除操作:delete from test where id=1;

 

ACID概念

  Atomicity(原子性)

    要么全部完成,要么全部不完成

    事务在执行过程中发生错误,会被回滚(Rollback)到事务开始前的状态

  Consistency(一致性)

    不论事务并发多少个,系统也必须如同串行事务一样操作

    例如:AB账户之间转账5元,BE之间转账15元,三个账户总额不变

  Isolation(隔离性)

    在同一时间仅有一个请求修改或获取同一数据,每一事务会被串行化处理

  Durability(持久性)

    在事务完成以后,所作的更改便持久的保存在数据库中

支持事务(Transaction)的数据库

在事务过程(Transaction processing)能够保证数据的正确性

 

ACID实现的两种方式

  Write ahead logging

  Shadow paging

 

cli配置不起作用,必须在hive-site.xml配置

<property>

    <name>hive.support.concurrency</name>

    <value>true</value>

    <description>

      控制是否开发并发控制,当使用zookeeper Hive lock manager时,zookeeper必须运行

    </description>

</property>

<property>

    <name>hive.txn.manager</name>

    <value>org.apache.hadoop.hive.ql.lockmgr.DbTxnManager</value>

    <description>

      设置DbTxnManager开启事物,同时需设置

      hive.compactor.initiator.on,

      hive.compactor.worker.threads,

      hive.support.concurrency (true),

      hive.enforce.bucketing(true),

      hive.exec.dynamic.partition.mode (nonstrict).

      默认的DummyTxnManager不提供事务

    </description>

</property>

<property>

    <name>hive.enforce.bucketing</name>

    <value>true</value>

</property>

<property>

    <name>hive.compactor.initiator.on</name>

    <value>true</value>

</property>

<property>

    <name>hive.compactor.worker.threads</name>

    <value>2</value>

</property>

<property>

    <name>hive.exec.dynamic.partition.mode</name>

    <value>nonstrict</value>

</property>

<property>

  <name>hive.txn.max.open.batch</name>

  <value>1000</value>

  <description>

    事务的最大数量,可以用open_txns()获取

  </description>

</property>

 

如果在这段时间内,客户端没有发出心跳信号,事务就会被宣告终止。

hive.txn.timeout

  300

压缩作业在该时间内未完成将被宣布失败,压缩操作被重新排入队列。

hive.compactor.worker.timeout

  86400

检查是否有分区需要被压缩。

hive.compactor.check.interval

  300

delta目录的数目,达到该数后将触发次要压缩。

hive.compactor.delta.num.threshold

  10

与基础文件中delta文件的百分比,达到该值会触发主要压缩。(1 = 100%

hive.compactor.delta.pct.threshold

  0.1

给定分区中已中止事务的数目,达到该数会也会触发主要压缩。

hive.compactor.abortedtxn.threshold

  1000

 

 

 

 

 

 

5.2.1分区表

image.png

image.png

例:

create table t7_partition_1 (

  id int comment 'ID',

  name string comment 'name'

) partitioned by (create_time date comment 'create time')

row format delimited

fields terminated by '\t';

 

create table t7_partition_2 (

  id int comment 'ID',

  name string comment 'name'

) partitioned by (year int comment 'admission year',

school string comment 'school name')

row format delimited

fields terminated by '\t';

 

查看表分区:

show partitions  t7_partition_1;

删除分区:

alter table tblName drop partition(partition_spec);

alter table t7_partition_1 drop partition(create_time='2015-12-25');

添加分区

alter table t7_partition_1 add partition(create_time='2015-12-25');

alter table t7_partition_2 add partition(year=2014,school='呵呵');

 

将指定分区的数据打成har

alter table t7_partition_1 archive partition(create_time='2015-12-25');

解开指定分区的har

alter table t7_partition_1 unarchive partition(create_time='2015-12-25');

 

防止分区被删除或被查询

alter table t7_partition_1 partition(create_time='2015-12-25') enable no_drop;

alter table t7_partition_1 partition(create_time='2015-12-25') enable offline;

恢复

alter table t7_partition_1 partition(create_time='2015-12-25') disable no_drop;

alter table t7_partition_1 partition(create_time='2015-12-25') disable offline;

 

 

修改已存在分区的位置

alter table t7_partition_1 partition(create_time='2015-12-25') set location 'hdfs://mini1:9000/hdfs/sheets';

添加分区同时指定位置

alter table t7_partition_1 add partition(create_time='2015-12-24') location 'hdfs://mini1:9000/hdfs/input';

加载数据到分区

load data local inpath '/root/t1' into table t7_partition_1 partition(create_time='2015-12-23');

 

对分区表查询必须加分区过滤

set hive.mapred.mode=strict;

不严格模式

set hive.mapred.mode=nostrict;

 

 

如何用脚本自动给分区表 test(分区字段 statDate string)每天创建一个分区?

yesterday=`date -d '1 days ago' +'%Y%m%d'`

hive -e "use default;alter table test add partition

(statdate='$yesterday') location '/data/workcopy/$yesterday';"

 

 

5.2.2桶表

CREATE TABLE page_view(viewTime INT, userid BIGINT,

     page_url STRING, referrer_url STRING,

     ip STRING COMMENT 'IP Address of the User')

 COMMENT 'This is the page view table'

 PARTITIONED BY(dt STRING, country STRING)

 CLUSTERED BY(userid) SORTED BY(viewTime) INTO 32 BUCKETS

 ROW FORMAT DELIMITED

   FIELDS TERMINATED BY '\001'

   COLLECTION ITEMS TERMINATED BY '\002'

   MAP KEYS TERMINATED BY '\003'

 STORED AS SEQUENCEFILE;

 

例:

create table t8_bucket(id int) clustered by (id) into 3 buckets;

create table sc_buck(Sno int,Cno   int,Grade int)

clustered by(Sno)

sorted by(Grade DESC)

into 4 buckets

row format delimited

fields terminated by ',';

 

开启分桶,使reduce的数量自动适配 bucket 的个数

set hive.enforce.bucketing=true;

向桶表插入数据

insert into table t8_bucket select * from t1 where id is not null;

insert overwrite table sc_buck

select * from sc cluster by(Sno);

 

ALTER TABLE table_name CLUSTERED BY (col_name, col_name, ...) [SORTED BY (col_name, ...)]

  INTO num_buckets BUCKETS;

 

 

5.2.3分区表分区归档har

需走mapreduce,速度较慢

mkdir $HIVE_HOME/auxlib/

cp   $HADOOP_HOME/share/hadoop/tools/lib/hadoop-archives-2.6.0.jar   $HIVE_HOME/auxlib/

 

 

set hive.archive.enabled=true;

alter table t7_partition_1 archive   partition(create_time='2012-04-07');

解归档

alter table t7_partition_1 unarchive partition(create_time='2012-04-07');

 

 

5.3修改表属性(描述&分隔符)

修改表描述:

alter table t1 set tblproperties ('comment' = 'new_comment');

 

 

内部表与外部表互转

-->内:

  alter table external_tbl set tblproperties("EXTERNAL"="FALSE");

然后把元数据中TBLS表对应行的 TBL_TYPE 修改为 MANAGED_TABLE

 

-->外:

  alter table managed_tbl set tblproperties("EXTERNAL"="TRUE");

然后把元数据中TBLS表对应行的 TBL_TYPE 修改为 EXTERNAL_TABLE

 

 

修改表的分隔符

alter table t1 set serdeproperties('field.delim'='\t');

alter table t1 set serdeproperties('colelction.delim'='\t');

alter table t1 set serdeproperties('mapkey.delim'=':');

alter table t1 set serdeproperties('line.delim'='\n');

 

TRUNCATE TABLE table_name [PARTITION partition_spec];

partition_spec:

  : (partition_column = partition_col_value, partition_column = partition_col_value, ...)

例:

truncate table t1;

 

主键外键

ALTER TABLE table_name ADD CONSTRAINT constraint_name PRIMARY KEY (column, ...) DISABLE NOVALIDATE;

ALTER TABLE table_name ADD CONSTRAINT constraint_name FOREIGN KEY (column, ...) REFERENCES table_name(column, ...) DISABLE NOVALIDATE RELY;

ALTER TABLE table_name DROP CONSTRAINT constraint_name;

 

 

 

5.4视图

CREATE VIEW [IF NOT EXISTS] [db_name.]view_name [(column_name [COMMENT column_comment], ...) ]

  [COMMENT view_comment]

  [TBLPROPERTIES (property_name = property_value, ...)]

  AS SELECT ...;

Example:

CREATE VIEW onion_referrers(url COMMENT 'URL of Referring page')

  COMMENT 'Referrers to The Onion website'

  AS

  SELECT DISTINCT referrer_url

  FROM page_view

  WHERE page_url='http://www.theonion.com';

例:

create view t9_view as select * from t2;

 

DROP VIEW [IF EXISTS] [db_name.]view_name;

例:

DROP VIEW onion_referrers;

 

ALTER VIEW [db_name.]view_name SET TBLPROPERTIES table_properties;

 

table_properties:

  : (property_name = property_value, property_name = property_value, ...)

 

修改视图:

ALTER VIEW [db_name.]view_name AS select_statement;

 

5.5索引

索引可加快含有group by语句的查询速度

hive.optimize.index.filter:自动使用索引

hive.optimize.index.groupby:使用聚合索引优化GROUP BY操作

 

 

CREATE INDEX index_name

  ON TABLE base_table_name (col_name, ...)

  AS index_type

  [WITH DEFERRED REBUILD]

  [IDXPROPERTIES (property_name=property_value, ...)]

  [IN TABLE index_table_name]

  [

     [ ROW FORMAT ...] STORED AS ...

     | STORED BY ...

  ]

  [LOCATION hdfs_path]

  [TBLPROPERTIES (...)]

  [COMMENT "index comment"];

 

创建索引

create index t1_index on table t1(id) as

'org.apache.hadoop.hive.ql.index.compact.CompactIndexHandler'

with deferred rebuild in table t1_index_table;

as 指定索引器

重建索引

ALTER INDEX index_name ON table_name [PARTITION partition_spec] REBUILD;

例:

alter index t1_index on t1 rebuild;

显示索引

show formatted index on t1;

删除索引

DROP INDEX [IF EXISTS] index_name ON table_name;

例:

drop index if exists t1_index on t1;

 

5.6 show

SHOW (DATABASES|SCHEMAS) [LIKE 'identifier_with_wildcards'];

 

SHOW TABLES [IN database_name] ['identifier_with_wildcards'];

 

SHOW VIEWS [IN/FROM database_name] [LIKE 'pattern_with_wildcards'];<


了解更多详情请联系 今日值班讲师 或者直接加入千人QQ群进行咨询:210992946

分享到: