业务场景中Hive解析Json常用案例
json在线工具
- json格式转换在线工具
https://tool.lu/json/
- format格式互转:
// 格式化可以合并整行显示
{"name":"John Doe","age":35,"email":"john@example.com"}// 格式化可以展开显示,示例数据更清晰
{"name": "John Doe","age": 35,"email": "john@example.com"
}
- 转化常用yaml格式
工具提供了各种跨格式转化
age: 35
email: john@example.com
name: 'John Doe'
根据json串自动生成生成 Java 类等功能也非常实用
示例准备
需要Hive环境
- test测试表
-- 在Hive中建测试表
CREATE TABLE json_test_table (id INT,json_column STRING
) STORED AS TEXTFILE;
json_column 字段用来存储需要处理的 json 串
示例 1: 提取单个字段
假设 JSON 数据包含用户信息
{"name": "John Doe","age": 35,"email": "john@example.com"
}
- 样例数据
insert into json_test_table values (1,'{"name":"John Doe","age":35,"email":"john@example.com"}');select * from json_test_table where id = 1;
+------+---------------------------------------------------------+
| id | json_column |
+------+---------------------------------------------------------+
| 1 | {"name":"John Doe","age":35,"email":"john@example.com"} |
+------+---------------------------------------------------------+
使用 get_json_object
函数从 JSON 中提取字段:
SELECT get_json_object(json_column, '$.name') AS name FROM json_test_table where id = 1;+-----------+
| name |
+-----------+
| John Doe |
+-----------+
1 row selected (19.655 seconds)SELECT get_json_object(json_column, '$.name') AS name,get_json_object(json_column, '$.age') AS age FROM json_test_table where id = 1;
+-----------+------+
| name | age |
+-----------+------+
| John Doe | 35 |
+-----------+------+
1 row selected (19.396 seconds)
示例 2: 提取嵌套字段
在 JSON 数据中有嵌套字段的情况下:
{"user": {"name": "Alice","address": {"city": "New York","zipcode": "10001"}}
}
- 样例数据
insert into json_test_table values (2,'{"user":{"name":"Alice","address":{"city":"New York","zipcode":"10001"}}}');select * from json_test_table where id = 2;
+------+---------------------------------------------------------------------------+
| id | json_column |
+------+---------------------------------------------------------------------------+
| 2 | {"user":{"name":"Alice","address":{"city":"New York","zipcode":"10001"}}} |
+------+---------------------------------------------------------------------------+
提取嵌套字段的值:
SELECT get_json_object(json_column, '$.user.name') AS user_name,get_json_object(json_column, '$.user.address.city') AS city
FROM json_test_table where id = 2;+------------+-----------+
| user_name | city |
+------------+-----------+
| Alice | New York |
+------------+-----------+
1 row selected (19.38 seconds)
示例 3: 提取数组中的值
假设 JSON 数据包含一个数组:
{"tags": ["apple", "banana", "orange"]
}
- 样例数据
insert into json_test_table values (3,'{"tags":["apple","banana","orange"]}');select * from json_test_table where id = 3;
+------+--------------------------------------+
| id | json_column |
+------+--------------------------------------+
| 3 | {"tags":["apple","banana","orange"]} |
+------+--------------------------------------+
从数组中提取值:
SELECT get_json_object(json_column, '$.tags[0]') AS first_tag,get_json_object(json_column, '$.tags[1]') AS second_tag
FROM json_test_table where id = 3;+------------+-------------+
| first_tag | second_tag |
+------------+-------------+
| apple | banana |
+------------+-------------+
1 row selected (18.488 seconds)
示例 4: 提取数组列表长度
获取数组的长度:
-- 复用 示例3中样例数据
select * from json_test_table where id = 3;
+------+--------------------------------------+
| id | json_column |
+------+--------------------------------------+
| 3 | {"tags":["apple","banana","orange"]} |
+------+--------------------------------------+
-- 可以看到数组tags对应的值有3个SELECT size(split(get_json_object(json_column, '$.tags'), ',')) AS tags_length
FROM json_test_table
WHERE id = 3;+--------------+
| tags_length |
+--------------+
| 3 |
+--------------+
1 row selected (35.766 seconds)
示例 5:业务场景-解析公司基本信息案例
- 假设JSON数据包含公司的基本信息,比如公司名称、注册资本等字段
{"company_name": "wangting_company","registration": {"registered_capital": 1000000,"registered_date": "2020-01-01","registered_address": "123 Main St, City"}
}{"company_name":"wangting_company","registration":{"registered_capital":1000000,"registered_date":"2020-01-01","registered_address":"123 Main St, City"}}
-- 创建表
CREATE TABLE company_info (company_name STRING,registered_capital INT,registered_date STRING,registered_address STRING
) STORED AS TEXTFILE;-- 转化插入数据
INSERT INTO company_info
SELECTget_json_object(json_data, '$.company_name') AS company_name,get_json_object(json_data, '$.registration.registered_capital') AS registered_capital,get_json_object(json_data, '$.registration.registered_date') AS registered_date,get_json_object(json_data, '$.registration.registered_address') AS registered_address
FROM(SELECT '{"company_name": "wangting_company", "registration": {"registered_capital": 1000000, "registered_date": "2020-01-01", "registered_address": "123 Main St, City"}}' AS json_data) t;+----------------------------+----------------------------------+-------------------------------+----------------------------------+
| company_info.company_name | company_info.registered_capital | company_info.registered_date | company_info.registered_address |
+----------------------------+----------------------------------+-------------------------------+----------------------------------+
| wangting_company | 1000000 | 2020-01-01 | 123 Main St, City |
+----------------------------+----------------------------------+-------------------------------+----------------------------------+
1 row selected (0.411 seconds)
示例 6: 使用 LATERAL VIEW 解析数组
使用 LATERAL VIEW
和 explode
解析 JSON 数组:
-- 复用示例3样例数据
SELECT id, tag
FROM json_test_table
LATERAL VIEW explode(split(get_json_object(json_column, '$.tags'), ',')) exploded_tags AS tag
WHERE id = 3;+-----+------------+
| id | tag |
+-----+------------+
| 3 | ["apple" |
| 3 | "banana" |
| 3 | "orange"] |
+-----+------------+
3 rows selected (17.318 seconds)-- 仅展示解析数据,实际使用如需去除方括号,可以一并清洗
示例 7: 提取数组对象的值
如果数组包含对象,则提取对象的值:
{"users": [{"name": "Alice","age": 28},{"name": "Bob","age": 35}]
}{"users":[{"name":"Alice","age":28},{"name":"Bob","age":35}]}
- 样例数据
insert into json_test_table values (7,'{"users":[{"name":"Alice","age":28},{"name":"Bob","age":35}]}');select * from json_test_table where id = 7;
+------+---------------------------------------------------------------+
| id | json_column |
+------+---------------------------------------------------------------+
| 7 | {"users":[{"name":"Alice","age":28},{"name":"Bob","age":35}]} |
+------+---------------------------------------------------------------+
SELECT get_json_object(json_column, '$.users[0].name') AS user1_name,get_json_object(json_column, '$.users[1].name') AS user2_name
FROM json_test_table where id = 7;+-------------+-------------+
| user1_name | user2_name |
+-------------+-------------+
| Alice | Bob |
+-------------+-------------+
1 row selected (17.372 seconds)
示例 8: 处理日期
处理 JSON 中的日期字段:
{"event_date": "2023-11-21"
}{"event_date":"2023-11-21"}
- 样例数据
insert into json_test_table values (8,'{"event_date":"2023-11-21"}');select * from json_test_table where id = 8;
+------+-----------------------------+
| id | json_column |
+------+-----------------------------+
| 8 | {"event_date":"2023-11-21"} |
+------+-----------------------------+
SELECT get_json_object(json_column, '$.event_date') AS event_date
FROM json_test_table where id = 8;+-------------+
| event_date |
+-------------+
| 2023-11-21 |
+-------------+
1 row selected (17.436 seconds)SELECT id, CAST(get_json_object(json_column, '$.event_date') AS DATE) AS event_date,YEAR(CAST(get_json_object(json_column, '$.event_date') AS DATE)) AS event_year,MONTH(CAST(get_json_object(json_column, '$.event_date') AS DATE)) AS event_month
FROM json_test_table
WHERE id = 8;+-----+-------------+-------------+--------------+
| id | event_date | event_year | event_month |
+-----+-------------+-------------+--------------+
| 8 | 2023-11-21 | 2023 | 11 |
+-----+-------------+-------------+--------------+
1 row selected (17.363 seconds)
示例 9: 提取布尔值
提取 JSON 中的布尔字段:
{"is_active": true
}{"is_active":true}
- 样例数据
insert into json_test_table values (9,'{"is_active":true}');select * from json_test_table where id = 9;
+------+--------------------+
| id | json_column |
+------+--------------------+
| 9 | {"is_active":true} |
+------+--------------------+
SELECT get_json_object(json_column, '$.is_active') AS is_active
FROM json_test_table where id = 9;
+------------+
| is_active |
+------------+
| true |
+------------+
1 row selected (18.401 seconds)-- 使用WHERE子句根据布尔值进行条件过滤。
SELECT *
FROM json_test_table
WHERE get_json_object(json_column, '$.is_active') = 'true';+---------------------+------------------------------+
| json_test_table.id | json_test_table.json_column |
+---------------------+------------------------------+
| 9 | {"is_active":true} |
+---------------------+------------------------------+
1 row selected (17.327 seconds)
示例 10: json_tuple计算 JSON 对象数量
计算 JSON 对象中包含的键值对数量:
SELECT json_tuple(json_column, 'name', 'age') AS (name, age)
FROM json_test_table;+-----------+-------+
| name | age |
+-----------+-------+
| John Doe | 35 |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
| NULL | NULL |
+-----------+-------+
6 rows selected (17.344 seconds)-- 只有id=1符合条件 不符合条件均为NULL
示例 11: 使用 json_tuple
提取值
从 JSON 对象中使用 json_tuple
提取多个字段:
{"name": "John","age": 30,"address": "New York"
}{"name":"John","age":30,"address":"New York"}
- 样例数据
insert into json_test_table values (11,'{"name":"John","age":30,"address":"New York"}');select * from json_test_table where id = 11;
+------+-----------------------------------------------+
| id | json_column |
+------+-----------------------------------------------+
| 11 | {"name":"John","age":30,"address":"New York"} |
+------+-----------------------------------------------+
SELECT json_tuple(json_column, 'name', 'age', 'address') AS (name, age, address)
FROM json_test_table where id = 11;+-------+------+-----------+
| name | age | address |
+-------+------+-----------+
| John | 30 | New York |
+-------+------+-----------+
1 row selected (17.414 seconds)
示例 12: 提取数字字段
从 JSON 中提取数字字段:
{"count": 50
}{"count":50}
- 样例数据
insert into json_test_table values (12,'{"count":50}');select * from json_test_table where id = 12;
+------+--------------+
| id | json_column |
+------+--------------+
| 12 | {"count":50} |
+------+--------------+
SELECT get_json_object(json_column, '$.count') AS count
FROM json_test_table where id = 12;+--------+
| count |
+--------+
| 50 |
+--------+
1 row selected (19.401 seconds)
示例 13: 使用 get_json_object
过滤数据
根据 JSON 中的字段值进行过滤:
SELECT *
FROM json_test_table
WHERE get_json_object(json_column, '$.age') > 30;+---------------------+----------------------------------------------------+
| json_test_table.id | json_test_table.json_column |
+---------------------+----------------------------------------------------+
| 1 | {"name":"John Doe","age":35,"email":"john@example.com"} |
+---------------------+----------------------------------------------------+
1 row selected (18.402 seconds)SELECT *
FROM json_test_table
WHERE get_json_object(json_column, '$.age') > 10;+---------------------+----------------------------------------------------+
| json_test_table.id | json_test_table.json_column |
+---------------------+----------------------------------------------------+
| 1 | {"name":"John Doe","age":35,"email":"john@example.com"} |
| 11 | {"name":"John","age":30,"address":"New York"} |
+---------------------+----------------------------------------------------+
2 rows selected (17.298 seconds)
示例 14: 处理空值
处理 JSON 数据中可能的空值:
{"status": null
}{"status":null}
- 样例数据
insert into json_test_table values (14,'{"status":null}');select * from json_test_table where id = 14;
+------+-----------------+
| id | json_column |
+------+-----------------+
| 14 | {"status":null} |
+------+-----------------+
SELECT get_json_object(json_column, '$.status') AS status
FROM json_test_table where id = 14;+---------+
| status |
+---------+
| NULL |
+---------+
1 row selected (17.345 seconds)-- 使用IS NULL或IS NOT NULL来检查字段是否为空
SELECT id, json_column
FROM json_test_table
WHERE id = 14 AND get_json_object(json_column, '$.status') IS NULL;+-----+------------------+
| id | json_column |
+-----+------------------+
| 14 | {"status":null} |
+-----+------------------+
1 row selected (17.47 seconds)-- 使用CASE语句对NULL值进行处理或者转换为其他值。
SELECT id, CASE WHEN get_json_object(json_column, '$.status') IS NULL THEN 'No Status' ELSE get_json_object(json_column, '$.status') END AS status
FROM json_test_table
WHERE id = 14;+-----+------------+
| id | status |
+-----+------------+
| 14 | No Status |
+-----+------------+
1 row selected (17.31 seconds)
示例 15: 使用 get_json_object
判断是否存在字段
判断 JSON 是否包含特定字段:
SELECT CASE WHEN get_json_object(json_column, '$.name') IS NOT NULL THEN 'Exists' ELSE 'Not Exists' END AS field_status
FROM json_test_table;
+---------------+
| field_status |
+---------------+
| Exists |
| Not Exists |
| Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
+---------------+SELECT CASE WHEN get_json_object(json_column, '$.age') IS NOT NULL THEN 'Exists' ELSE 'Not Exists' END AS field_status
FROM json_test_table;+---------------+
| field_status |
+---------------+
| Exists |
| Not Exists |
| Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
+---------------+SELECT CASE WHEN get_json_object(json_column, '$.aaaaaaaaa') IS NOT NULL THEN 'Exists' ELSE 'Not Exists' END AS field_status
FROM json_test_table;+---------------+
| field_status |
+---------------+
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
| Not Exists |
+---------------+
9 rows selected (17.337 seconds)
示例 16: 提取多层嵌套对象的值
从多层嵌套的 JSON 对象中提取值:
{"user": {"details": {"name": "Alice","age": 30}}
}{"user":{"details":{"name":"Alice","age":30}}}
- 样例数据
insert into json_test_table values (16,'{"user":{"details":{"name":"Alice","age":30}}}');select * from json_test_table where id = 16;
+------+------------------------------------------------+
| id | json_column |
+------+------------------------------------------------+
| 16 | {"user":{"details":{"name":"Alice","age":30}}} |
+------+------------------------------------------------+
SELECT get_json_object(json_column, '$.user.details.name') AS user_name,get_json_object(json_column, '$.user.details.age') AS user_age
FROM json_test_table where id = 16;+------------+-----------+
| user_name | user_age |
+------------+-----------+
| Alice | 30 |
+------------+-----------+
1 row selected (16.253 seconds)
示例 17: 提取嵌套对象数组的值
从嵌套的对象数组中提取值:
{"employees": [{"name": "Alice","department": "HR"},{"name": "Bob","department": "Engineering"}]
}{"employees":[{"name":"Alice","department":"HR"},{"name":"Bob","department":"Engineering"}]}
- 样例数据
insert into json_test_table values (17,'{"employees":[{"name":"Alice","department":"HR"},{"name":"Bob","department":"Engineering"}]}');select * from json_test_table where id = 17;
+------+----------------------------------------------------------------------------------------------+
| id | json_column |
+------+----------------------------------------------------------------------------------------------+
| 17 | {"employees":[{"name":"Alice","department":"HR"},{"name":"Bob","department":"Engineering"}]} |
+------+----------------------------------------------------------------------------------------------+
SELECT get_json_object(json_column, '$.employees[0].name') AS employee1_name,get_json_object(json_column, '$.employees[1].name') AS employee2_name
FROM json_test_table where id = 17;+-----------------+-----------------+
| employee1_name | employee2_name |
+-----------------+-----------------+
| Alice | Bob |
+-----------------+-----------------+
1 row selected (17.273 seconds)
示例 18: 提取数组对象的多个值
取数组对象的多个值
从数组对象中提取多个值:
{"items": [{"id": 1,"name": "Item 1"},{"id": 2,"name": "Item 2"}]
}{"items":[{"id":1,"name":"Item 1"},{"id":2,"name":"Item 2"}]}
- 样例数据
insert into json_test_table values (18,'{"items":[{"id":1,"name":"Item 1"},{"id":2,"name":"Item 2"}]}');select * from json_test_table where id = 18;
+------+---------------------------------------------------------------+
| id | json_column |
+------+---------------------------------------------------------------+
| 18 | {"items":[{"id":1,"name":"Item 1"},{"id":2,"name":"Item 2"}]} |
+------+---------------------------------------------------------------+
SELECT get_json_object(json_column, '$.items[0].id') AS item1_id,get_json_object(json_column, '$.items[0].name') AS item1_name,get_json_object(json_column, '$.items[1].id') AS item2_id,get_json_object(json_column, '$.items[1].name') AS item2_name
FROM json_test_table where id = 18;+-----------+-------------+-----------+-------------+
| item1_id | item1_name | item2_id | item2_name |
+-----------+-------------+-----------+-------------+
| 1 | Item 1 | 2 | Item 2 |
+-----------+-------------+-----------+-------------+
1 row selected (17.386 seconds)
示例19: json_serde解析映射数据文件的使用介绍
json_serde
库提供了一种在Hive中直接将JSON格式的数据解析成表的方式,可简化处理流程。
如果现在有一个json文件:
[app@ali-wangting wangt]$ cat wangt.json
{"name":"wangting01","age":21,"email":"wangting01@example.com"}
{"name":"wangting02","age":22,"email":"wangting02@example.com"}
{"name":"wangting03","age":23,"email":"wangting03@example.com"}
{"name":"wangting04","age":24,"email":"wangting04@example.com"}
{"name":"wangting05","age":25,"email":"wangting05@example.com"}
{"name":"wangting06","age":26,"email":"wangting06@example.com"}
{"name":"wangting07","age":27,"email":"wangting07@example.com"}
{"name":"wangting08","age":28,"email":"wangting08@example.com"}
现在希望建立一张表,使得表可以直接映射上数据文件
- hive中建表
CREATE TABLE IF NOT EXISTS wangt_666_json(name string,age int,email string
)
ROW FORMAT SERDE 'org.apache.hive.hcatalog.data.JsonSerDe'
stored as textfile;desc wangt_json_table;
+-----------+------------+--------------------+
| col_name | data_type | comment |
+-----------+------------+--------------------+
| name | string | from deserializer |
| age | int | from deserializer |
| email | string | from deserializer |
+-----------+------------+--------------------+
- 将json文件上传至建表对应hdfs路径
[app@ali-wangting wangt]$ hdfs dfs -ls /user/hive/warehouse/wangt.db/wangt_666_json/
[app@ali-wangting wangt]$ hdfs dfs -put wangt.json /user/hive/warehouse/wangt.db/wangt_666_json/
[app@ali-wangting wangt]$ hdfs dfs -ls /user/hive/warehouse/wangt.db/wangt_666_json/
Found 1 items
-rwxrwx--x+ 2 hive hive 512 2023-12-04 16:38 /user/hive/warehouse/wangt.db/wangt_666_json/wangt.json# 也可以通过hive的sql命令行直接load数据文件
load data local inpath '/home/wangt/wangt.json' into table wangt_666_json;
- 查看表数据验证
select * from wangt_666_json;
+----------------------+---------------------+-------------------------+
| wangt_666_json.name | wangt_666_json.age | wangt_666_json.email |
+----------------------+---------------------+-------------------------+
| wangting01 | 21 | wangting01@example.com |
| wangting02 | 22 | wangting02@example.com |
| wangting03 | 23 | wangting03@example.com |
| wangting04 | 24 | wangting04@example.com |
| wangting05 | 25 | wangting05@example.com |
| wangting06 | 26 | wangting06@example.com |
| wangting07 | 27 | wangting07@example.com |
| wangting08 | 28 | wangting08@example.com |
+----------------------+---------------------+-------------------------+
8 rows selected (0.5 seconds)
可以看到数据文件8行记录,成功映射对应了hive中wangt_666_json表的8条数据