drill apache_如何指南:Apache Drill入门

drill apache

Apache Drill是一个引擎,可以连接到许多不同的数据源,并为它们提供SQL接口。 它不仅是遍历任何复杂事物SQL界面,而且是功能强大的界面, 其中包括对许多内置函数和窗口函数的支持。 尽管它可以连接到可以使用SQL进行查询的标准数据源(例如Oracle或MySQL),但它还可以处理诸如CSV或JSON等平面文件以及Avro和Parquet格式。 正是这种对文件运行SQL的能力首先激发了我对Apache Drill的兴趣。 我花了大量时间研究大数据架构和工具,包括大数据发现 。 作为此过程的一部分,尝试数据管道选项 ,我发现的差距之一是在将它们引入Hive之类的东西之前,可以以原始状态挖掘文件的功能,从而可以通过BDD和其他工具。

在本文中,我将逐步介绍Apache Drill的入门知识,并展示一些我认为是其有用性的很好例子的查询类型。

入门

开始使用Apache Drill非常简单–只需下载并解压缩并运行即可。 虽然它可以跨机器分布运行以提高性能,但也可以在笔记本电脑上独立运行。

启动它

cd /opt/apache-drill-1.7.0/
bin/sqlline -u jdbc:drill:zk=local

如果您No current connectioncom.fasterxml.jackson.databind.JavaType.isReferenceType()Z则您遇到冲突的JAR问题 (例如,我在Oracle BigDataLite VM上遇到此问题 ),应在干净的环境下启动它

env -i HOME="$HOME" LC_CTYPE="${LC_ALL:-${LC_CTYPE:-$LANG}}" PATH="$PATH" USER="$USER" /opt/apache-drill-1.7.0/bin/drill-embedded

有一个内置数据集可用于测试:

USE cp;
SELECT employee_id, first_name FROM `employee.json` limit 5;

如果您习惯使用SQL * Plus和类似的工具,那么在非常熟悉的环境中,这应该返回五行:

0: jdbc:drill:zk=local> USE cp;
+-------+---------------------------------+
|  ok   |             summary             |
+-------+---------------------------------+
| true  | Default schema changed to [cp]  |
+-------+---------------------------------+
1 row selected (1.776 seconds)
0: jdbc:drill:zk=local>     SELECT employee_id, first_name FROM `employee.json` limit 5;
+--------------+-------------+
| employee_id  | first_name  |
+--------------+-------------+
| 1            | Sheri       |
| 2            | Derrick     |
| 4            | Michael     |
| 5            | Maya        |
| 6            | Roberta     |
+--------------+-------------+
5 rows selected (3.624 seconds)

到目前为止,是如此SQL,如此的关系-如此熟悉,真的。 Apache Drill开始偏离明显之处的是它对存储处理程序的使用。 在上面的查询中,cp是我们要对其运行查询的“数据库”,但实际上这是默认情况下定义的“类路径”(因此称为“ cp”)存储处理程序。 在“数据库”中,存在“方案”,它们是存储处理程序的子配置。 稍后我们将查看和定义这些内容。 现在,知道您也可以列出可用的数据库很有用:

0: jdbc:drill:zk=local> show databases;
+---------------------+
|     SCHEMA_NAME     |
+---------------------+
| INFORMATION_SCHEMA  |
| cp.default          |
| dfs.default         |
| dfs.root            |
| dfs.tmp             |
| sys                 |
+---------------------+

注意databases命令是schemas的同义词; 两者都返回的是<database>.<schema> 。 在Apache Drill中,反引号用于包围标识符(例如模式名称,列名称等),并且它非常具体。 例如,这是有效的:

0: jdbc:drill:zk=local> USE `cp.default`;
+-------+-----------------------------------------+
|  ok   |                 summary                 |
+-------+-----------------------------------------+
| true  | Default schema changed to [cp.default]  |
+-------+-----------------------------------------+
1 row selected (0.171 seconds)

虽然不是:

0: jdbc:drill:zk=local> USE cp.default;
Error: PARSE ERROR: Encountered ". default" at line 1, column 7.
Was expecting one of:
<EOF>
"." <IDENTIFIER> ...
"." <QUOTED_IDENTIFIER> ...
"." <BACK_QUOTED_IDENTIFIER> ...
"." <BRACKET_QUOTED_IDENTIFIER> ...
"." <UNICODE_QUOTED_IDENTIFIER> ...
"." "*" ...SQL Query USE cp.default

这是因为default是保留字,因此必须加引号。 因此,您也可以使用

0: jdbc:drill:zk=local> use cp.`default`;

但不是

0: jdbc:drill:zk=local> use `cp`.default;

查询JSON数据

在Apache Drill网站上有一些有用的教程 ,其中包括一个使用Yelp提供的数据的教程 。 这是最初让我着眼于Drill的数据集,因为我将其用作大数据发现 (BDD)的输入,但在两个方面都遇到了困难。 首先是如何最好地在其上定义合适的Hive表,以便将其提取到BDD中。 接下来,我们试图了解数据中可能包含的价值,这将花费多长时间来完善我在Hive中公开数据的方式。 以下示例显示了以表格形式查询时,复杂JSON可能带来的复杂性。

首先,查询JSON文件,并自动推断出架构。 很酷

0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` limit 5;
+---------+------+-------------+-------+------+------+
| user_id | text | business_id | likes | date | type |
+---------+------+-------------+-------+------+------+
| -6rEfobYjMxpUWLNxszaxQ | Don't waste your time. | cE27W9VPgO88Qxe4ol6y_g | 0 | 2013-04-18 | tip |
| EZ0r9dKKtEGVx2CdnowPCw | Your GPS will not allow you to find this place. Put Rankin police department in instead. They are directly across the street. | mVHrayjG3uZ_RLHkLj-AMg | 1 | 2013-01-06 | tip |
| xb6zEQCw9I-Gl0g06e1KsQ | Great drink specials! | KayYbHCt-RkbGcPdGOThNg | 0 | 2013-12-03 | tip |
| QawZN4PSW7ng_9SP7pjsVQ | Friendly staff, good food, great beer selection, and relaxing atmosphere | KayYbHCt-RkbGcPdGOThNg | 0 | 2015-07-08 | tip |
| MLQre1nvUtW-RqMTc4iC9A | Beautiful restoration. | 1_lU0-eSWJCRvNGk78Zh9Q | 0 | 2015-10-25 | tip |
+---------+------+-------------+-------+------+------+
5 rows selected (2.341 seconds)

我们可以使用标准SQL聚合,例如COUNT

0: jdbc:drill:zk=local> select count(*) from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json`;
+---------+
| EXPR$0  |
+---------+
| 591864  |
+---------+
1 row selected (4.495 seconds)

以及GROUP BY操作:

0: jdbc:drill:zk=local> select `date`,count(*) as tip_count from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5;
+-------------+------------+
|    date     | tip_count  |
+-------------+------------+
| 2012-07-21  | 719        |
| 2012-05-19  | 718        |
| 2012-08-04  | 699        |
| 2012-06-23  | 690        |
| 2012-07-28  | 682        |
+-------------+------------+
5 rows selected (7.111 seconds)

稍微研究一下数据,我们可以看到它并不完全平坦–注意,例如, hours列,它是一个嵌套的JSON对象:

0: jdbc:drill:zk=local> select full_address,city,hours from `/user/oracle/incoming/yelp/business_json` b limit 5;
+--------------+------+-------+
| full_address | city | hours |
+--------------+------+-------+
| 4734 Lebanon Church Rd
Dravosburg, PA 15034 | Dravosburg | {"Friday":{"close":"21:00","open":"11:00"},"Tuesday":{"close":"21:00","open":"11:00"},"Thursday":{"close":"21:00","open":"11:00"},"Wednesday":{"close":"21:00","open":"11:00"},"Monday":{"close":"21:00","open":"11:00"},"Sunday":{},"Saturday":{}} |
| 202 McClure St
Dravosburg, PA 15034 | Dravosburg | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 1 Ravine St
Dravosburg, PA 15034 | Dravosburg | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 1530 Hamilton Rd
Bethel Park, PA 15234 | Bethel Park | {"Friday":{},"Tuesday":{},"Thursday":{},"Wednesday":{},"Monday":{},"Sunday":{},"Saturday":{}} |
| 301 South Hills Village
Pittsburgh, PA 15241 | Pittsburgh | {"Friday":{"close":"17:00","open":"10:00"},"Tuesday":{"close":"21:00","open":"10:00"},"Thursday":{"close":"17:00","open":"10:00"},"Wednesday":{"close":"21:00","open":"10:00"},"Monday":{"close":"21:00","open":"10:00"},"Sunday":{"close":"18:00","open":"11:00"},"Saturday":{"close":"21:00","open":"10:00"}} |
+--------------+------+-------+
5 rows selected (0.721 seconds)
0: jdbc:drill:zk=local>

使用Apache Drill,我们可以简单地使用点表示法来访问嵌套值。 在执行此操作时,必须为表加上别名(在本示例中为b ):

0: jdbc:drill:zk=local> select b.hours from `/user/oracle/incoming/yelp/business_json` b limit 1;
+-------+
| hours |
+-------+
| {"Friday":{"close":"21:00","open":"11:00"},"Tuesday":{"close":"21:00","open":"11:00"},"Thursday":{"close":"21:00","open":"11:00"},"Wednesday":{"close":"21:00","open":"11:00"},"Monday":{"close":"21:00","open":"11:00"},"Sunday":{},"Saturday":{}} |
+-------+

嵌套对象本身也可以嵌套-Apache Drill并不存在问题,我们只是将点符号进一步链接起来:

0: jdbc:drill:zk=local> select b.hours.Friday from `/user/oracle/incoming/yelp/business_json` b limit 1;
+-----------------------------------+
|              EXPR$0               |
+-----------------------------------+
| {"close":"21:00","open":"11:00"}  |
+-----------------------------------+
1 row selected (0.238 seconds)

请注意使用反引号( ` )引用保留的openclose关键字:

0: jdbc:drill:zk=local> select b.hours.Friday.`open`,b.hours.Friday.`close` from `/user/oracle/incoming/yelp/business_json` b limit 1;
+---------+---------+
| EXPR$0  | EXPR$1  |
+---------+---------+
| 11:00   | 21:00   |
+---------+---------+
1 row selected (0.58 seconds)

嵌套列本身就是查询中的适当对象,也可以用作谓词:

0: jdbc:drill:zk=local> select b.name,b.full_address,b.hours.Friday.`open` from `/user/oracle/incoming/yelp/business_json` b where b.hours.Friday.`open` = '11:00' limit 5;
+------------------------+------------------------------------------------+---------+
|          name          |                  full_address                  | EXPR$2  |
+------------------------+------------------------------------------------+---------+
| Mr Hoagie              | 4734 Lebanon Church Rd
Dravosburg, PA 15034    | 11:00   |
| Alexion's Bar & Grill  | 141 Hawthorne St
Greentree
Carnegie, PA 15106  | 11:00   |
| Rocky's Lounge         | 1201 Washington Ave
Carnegie, PA 15106         | 11:00   |
| Papa J's               | 200 E Main St
Carnegie
Carnegie, PA 15106      | 11:00   |
| Italian Village Pizza  | 2615 Main St
Homestead, PA 15120               | 11:00   |
+------------------------+------------------------------------------------+---------+
5 rows selected (0.404 seconds)

您会在上面的输出中注意到, full_address字段中包含换行符—我们可以使用SQL函数用逗号替换换行符:

0: jdbc:drill:zk=local> select b.name,regexp_replace(b.full_address,'\n',','),b.hours.Friday.`open` from `/user/oracle/incoming/yelp/business_json` b where b.hours.Friday.`open` = '11:00' limit 5;
+------------------------+------------------------------------------------+---------+
|          name          |                     EXPR$1                     | EXPR$2  |
+------------------------+------------------------------------------------+---------+
| Mr Hoagie              | 4734 Lebanon Church Rd,Dravosburg, PA 15034    | 11:00   |
| Alexion's Bar & Grill  | 141 Hawthorne St,Greentree,Carnegie, PA 15106  | 11:00   |
| Rocky's Lounge         | 1201 Washington Ave,Carnegie, PA 15106         | 11:00   |
| Papa J's               | 200 E Main St,Carnegie,Carnegie, PA 15106      | 11:00   |
| Italian Village Pizza  | 2615 Main St,Homestead, PA 15120               | 11:00   |
+------------------------+------------------------------------------------+---------+
5 rows selected (1.346 seconds)

查询联合

因此,Apache Drill使您能够对多种格式和位置的数据运行SQL查询,这本身就非常有用。 但比这更好的是,它使您可以在单个查询中联合这些源。 这是在HDFS和Oracle中的数据之间进行联接的示例:

0: jdbc:drill:zk=local> select X.text,
. . . . . . . . . . . > Y.NAME
. . . . . . . . . . . > from hdfs.`/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` X
. . . . . . . . . . . > inner join ora.MOVIEDEMO.YELP_BUSINESS Y
. . . . . . . . . . . > on X.business_id = Y.BUSINESS_ID
. . . . . . . . . . . > where Y.NAME = 'Chick-fil-A'
. . . . . . . . . . . > limit 5;
+--------------------------------------------------------------------+--------------+
|                                text                                |     NAME     |
+--------------------------------------------------------------------+--------------+
| It's daddy daughter date night here and they go ALL OUT!           | Chick-fil-A  |
| Chicken minis!  The best part of waking up Saturday mornings.  :)  | Chick-fil-A  |
| Nice folks as always unlike those ghetto joints                    | Chick-fil-A  |
| Great clean and delicious chicken sandwiches!                      | Chick-fil-A  |
| Spicy Chicken with lettuce, tomato, and pepperjack cheese FTW!     | Chick-fil-A  |
+--------------------------------------------------------------------+--------------+
5 rows selected (3.234 seconds)

您可以为此定义一个视图:

0: jdbc:drill:zk=local> create or replace view dfs.tmp.yelp_tips as select X.text as tip_text, Y.NAME as business_name from hdfs.`/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` X inner join ora.MOVIEDEMO.YELP_BUSINESS Y on X.business_id = Y.BUSINESS_ID ;
+-------+-------------------------------------------------------------+
|  ok   |                           summary                           |
+-------+-------------------------------------------------------------+
| true  | View 'yelp_tips' replaced successfully in 'dfs.tmp' schema  |
+-------+-------------------------------------------------------------+
1 row selected (0.574 seconds)
0: jdbc:drill:zk=local> describe dfs.tmp.yelp_tips;
+----------------+--------------------+--------------+
|  COLUMN_NAME   |     DATA_TYPE      | IS_NULLABLE  |
+----------------+--------------------+--------------+
| tip_text       | ANY                | YES          |
| business_name  | CHARACTER VARYING  | YES          |
+----------------+--------------------+--------------+
2 rows selected (0.756 seconds)

然后将其作为任何常规对象进行查询:

0: jdbc:drill:zk=local> select tip_text,business_name from dfs.tmp.yelp_tips where business_name like '%Grill' limit 5;
+------+------+
| text | NAME |
+------+------+
| Great drink specials! | Alexion's Bar & Grill |
| Friendly staff, good food, great beer selection, and relaxing atmosphere | Alexion's Bar & Grill |
| Pretty quiet here... | Uno Pizzeria & Grill |
| I recommend this location for quick lunches. 10 min or less lunch menu. Soup bar ( all you can eat)  the broccoli cheddar soup is delicious. | Uno Pizzeria & Grill |
| Instead of pizza, come here for dessert. The deep dish sundae is really good. | Uno Pizzeria & Grill |
+------+------+
5 rows selected (3.272 seconds)

查询Twitter JSON数据

这是使用Drill查询包含一些Twitter数据的本地文件的示例。 如果您想尝试自己查询文件,可以在这里下载文件 。

首先,我切换到使用dfs存储插件:

0: jdbc:drill:zk=local> use dfs;
+-------+----------------------------------+
|  ok   |             summary              |
+-------+----------------------------------+
| true  | Default schema changed to [dfs]  |
+-------+----------------------------------+

然后尝试对文件进行选择。 注意limit 5子句–在您仅检查文件结构时非常有用。

0: jdbc:drill:zk=local> select * from `/user/oracle/incoming/twitter/geo_tweets.json` limit 5;
Error: DATA_READ ERROR: Error parsing JSON - Unexpected end-of-input within/between OBJECT entriesFile  /user/oracle/incoming/twitter/geo_tweets.json
Record  2819
Column  3503
Fragment 0:0

一个错误? 那不应该发生的。 我有一个JSON文件,对不对? 事实证明,JSON文件是每行一个完整的JSON对象。 除了不在最后一个记录上。 请注意上面的错误– 2819中给出的记录计数:

[oracle@bigdatalite ~]$ wc -l geo_tweets.json
2818 geo_tweets.json

因此,该文件只有2818条完整的行。 嗯 让我们使用头/尾巴组合来查看该记录:

[oracle@bigdatalite ~]$ head -n 2819 geo_tweets.json |tail -n1
{"created_at":"Sun Jul 24 21:00:44 +0000 2016","id":757319630432067584,"id_str":"757319630432067584","text":"And now @HillaryClinton hires @DWStweets: Honorary Campaign Manager across the USA #corruption #hillarysamerica  https://t.co/8jAGUu6w2f","source":"TweetCaster for iOS","truncated":false,"in_reply_to_status_id":null,"in_reply_to_status_id_str":null,"in_reply_to_user_id":null,"in_reply_to_user_id_str":null,"in_reply_to_screen_name":null,"user":{"id":2170786369,"id_str":"2170786369","name":"Patricia Weber","screen_name":"InnieBabyBoomer","location":"Williamsburg, VA","url":"http://lovesrantsandraves.blogspot.com/","description":"Baby Boomer, Swing Voter, Conservative, Spiritual, #Introvert, Wife, Grandma, Italian, ♥ Books, Cars, Ferrari, F1 Race♥  #tcot","protected":false,"verified":false,"followers_count":861,"friends_count":918,"listed_count":22,"favourites_count":17,"statuses_count":2363,"created_at":"Sat Nov 02 19:13:06 +0000 2013","utc_offset":null,"time_zone":null,"geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"C0DEED","profile_background_image_url":"http://pbs.twimg.com/profile_background_images/378800000107659131/3589f

这就是文件中的完整数据-Drill是正确的-JSON已损坏。 如果我们删除最后一条记录并创建一个新文件( geo_tweets.fixed.json

然后再次查询,我们得到了一些东西!

0: jdbc:drill:zk=local>  select text from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
+------+
| text |
+------+
| Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim |
| We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
| Donald Trump accepted the Republican nomination last night. Isis claimed responsibility. |
| Obama: "We must stand together and stop terrorism"
Trump: "We don't want these people in our country"|
| Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw |
+------+
5 rows selected (0.246 seconds)

此处的text是json字段之一。 我可以select *但不是很清楚:

0: jdbc:drill:zk=local>  select * from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
+------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+
| created_at | id | id_str | text | source | truncated | user | geo | coordinates | place | is_quote_status | retweet_count | favorite_count | entities | favorited | retweeted | possibly_sensitive | filter_level | lang | timestamp_ms | @version | @timestamp | user_name | user_screen_name | user_followers_count | user_friends_count | user_listed_count | user_favourites_count | user_statuses_count | user_created_at | place_name | place_country | hashtags_list | urls_array | urls_list | user_mentions_screen_name_list | longitude | latitude | hashtags_array | extended_entities | user_mentions_screen_name_array | in_reply_to_status_id | in_reply_to_status_id_str | in_reply_to_user_id | in_reply_to_user_id_str | in_reply_to_screen_name | retweeted_status | retweeted_screen_name | quoted_status_id | quoted_status_id_str | quoted_status |
+------------+----+--------+------+--------+-----------+------+-----+-------------+-------+-----------------+---------------+----------------+----------+-----------+-----------+--------------------+--------------+------+--------------+----------+------------+-----------+------------------+----------------------+--------------------+-------------------+-----------------------+---------------------+-----------------+------------+---------------+---------------+------------+-----------+--------------------------------+-----------+----------+----------------+-------------------+---------------------------------+-----------------------+---------------------------+---------------------+-------------------------+-------------------------+------------------+-----------------------+------------------+----------------------+---------------+
| Fri Jul 22 19:37:11 +0000 2016 | 756573827589545984 | 756573827589545984 | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim | dlvr.it | false | {"id":67898674,"id_str":"67898674","name":"Vancouver Press","screen_name":"Vancouver_CP","location":"Vancouver, BC","url":"http://vancouver.cityandpress.com/","description":"Latest news from Vancouver. Updates are frequent.","protected":false,"verified":false,"followers_count":807,"friends_count":13,"listed_count":94,"favourites_count":1,"statuses_count":131010,"created_at":"Sat Aug 22 14:25:37 +0000 2009","utc_offset":-25200,"time_zone":"Pacific Time (US & Canada)","geo_enabled":true,"lang":"en","contributors_enabled":false,"is_translator":false,"profile_background_color":"FFFFFF","profile_background_image_url":"http://abs.twimg.com/images/themes/theme1/bg.png","profile_background_image_url_https":"https://abs.twimg.com/images/themes/theme1/bg.png","profile_background_tile":false,"profile_link_color":"8A1C3B","profile_sidebar_border_color":"FFFFFF","profile_sidebar_fill_color":"FFFFFF","profile_text_color":"2A2C31","profile_use_background_image":false,"profile_image_url":"http://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png","profile_image_url_https":"https://pbs.twimg.com/profile_images/515841109553983490/_t0QWPco_normal.png","profile_banner_url":"https://pbs.twimg.com/profile_banners/67898674/1411821103","default_profile":false,"default_profile_image":false} | {"type":"Point","coordinates":[49.2814375,-123.12109067]} | {"type":"Point","coordinates":[-123.12109067,49.2814375]} | {"id":"1e5cb4d0509db554","url":"https://api.twitter.com/1.1/geo/id/1e5cb4d0509db554.json","place_type":"city","name":"Vancouver","full_name":"Vancouver, British Columbia","country_code":"CA","country":"Canada","bounding_box":{"type":"Polygon","coordinates":[[[-123.224215,49.19854],[-123.224215,49.316738],[-123.022947,49.316738],[-123.022947,49.19854]]]},"attributes":{}} | false | 0 | 0 | {"urls":[{"url":"https://t.co/joI9GMfRim","expanded_url":"http://toplocalnow.com/ca/vancouver?section=trends","display_url":"toplocalnow.com/ca/vancouver?s…","indices":[70,93]}],"hashtags":[],"user_mentions":[],"media":[],"symbols":[]} | false | false | false | low | en | 1469216231616 | 1 | 2016-07-22T19:37:11.000Z | Vancouver Press | Vancouver_CP | 807 | 13 | 94 | 1 | 131010 | Sat Aug 22 14:25:37 +0000 2009 | Vancouver | Canada |  | ["toplocalnow.com/ca/vancouver?s…"] | toplocalnow.com/ca/vancouver?s… |  | -123.12109067 | 49.2814375 | [] | {"media":[]} | [] | null | null | null | null | null | {"user":{},"entities":{"user_mentions":[],"media":[],"hashtags":[],"urls":[]},"extended_entities":{"media":[]},"quoted_status":{"user":{},"entities":{"hashtags":[],"user_mentions":[],"media":[],"urls":[]},"extended_entities":{"media":[]}}} | null | null | null | {"user":{},"entities":{"user_mentions":[],"media":[],"urls":[],"hashtags":[]},"extended_entities":{"media":[]},"place":{"bounding_box":{"coordinates":[]},"attributes":{}},"geo":{"coordinates":[]},"coordinates":{"coordinates":[]}} |

在twitter数据中,存在根级字段(例如text )和嵌套字段(例如user字段中有关高音扬声器的信息)。 如上所示,您使用点表示法引用了嵌套字段。 现在是指出您可能会遇到的几个常见错误的好时机。 首先是不引用保留字,而是检查是否Encountered "."诸如“ Encountered "."类的错误的第一件事Encountered "."

0: jdbc:drill:zk=local>  select user.screen_name,text from `/users/rmoff/data/geo_tweets.fixed.json` limit 5;
Error: PARSE ERROR: Encountered "." at line 1, column 12.
[...]

其次是在使用点表示法时声明表别名–如果您不这样做,则Apache Drill会认为父列实际上是表名( VALIDATION ERROR: [...] Table 'user' not found ):

0: jdbc:drill:zk=local>  select `user`.screen_name,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` limit 5;
Aug 10, 2016 11:16:45 PM org.apache.calcite.sql.validate.SqlValidatorException 
SEVERE: org.apache.calcite.sql.validate.SqlValidatorException: Table 'user' not found
Aug 10, 2016 11:16:45 PM org.apache.calcite.runtime.CalciteException 
SEVERE: org.apache.calcite.runtime.CalciteContextException: From line 1, column 8 to line 1, column 13: Table 'user' not found
Error: VALIDATION ERROR: From line 1, column 8 to line 1, column 13: Table 'user' not foundSQL Query null[Error Id: 1427fd23-e180-40be-a751-b6f1f838233a on 192.168.56.1:31010] (state=,code=0)

修正了这些错误之后,我们可以看到用户的屏幕名称:

0: jdbc:drill:zk=local>  select tweets.`user`.`screen_name` as user_screen_name,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 2;
+------------------+------+
| user_screen_name | text |
+------------------+------+
| Vancouver_CP     | Vancouver trends now: Trump, Evander Kane, Munich, 2016HCC and dcc16. https://t.co/joI9GMfRim |
| tmj_TUC_skltrd   | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
+------------------+------+
2 rows selected (0.256 seconds)
0: jdbc:drill:zk=local>

除了嵌套对象,JSON还支持数组。 Twitter数据中的一个示例是#标签或URL,在给定的tweet中,两者都可以为零,一个或多个。

0: jdbc:drill:zk=local>  select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 5;
+--------+
| EXPR$0 |
+--------+
| [] |
| [{"text":"hiring","indices":[6,13]},{"text":"Job","indices":[98,102]},{"text":"SkilledTrade","indices":[103,116]},{"text":"Tucson","indices":[117,124]},{"text":"Jobs","indices":[129,134]}] |
| [] |
| [] |
| [{"text":"lol","indices":[72,76]},{"text":"nowthatsfunny","indices":[77,91]}] |
+--------+
5 rows selected (0.286 seconds)

使用FLATTEN函数,每个数组条目都变成一个新行,因此:

0: jdbc:drill:zk=local>  select flatten(tweets.entities.hashtags) from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets limit 5;
+----------------------------------------------+
|                    EXPR$0                    |
+----------------------------------------------+
| {"text":"hiring","indices":[6,13]}           |
| {"text":"Job","indices":[98,102]}            |
| {"text":"SkilledTrade","indices":[103,116]}  |
| {"text":"Tucson","indices":[117,124]}        |
| {"text":"Jobs","indices":[129,134]}          |
+----------------------------------------------+
5 rows selected (0.139 seconds)

请注意, limit 5子句仅显示前五个数组实例,实际上这只是上述列表中第一个tweet的主题标签。

要访问主题标签的文本,我们使用子查询和点表示法来访问text字段:

0: jdbc:drill:zk=local> select ent_hashtags.hashtags.text from (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) as ent_hashtags limit 5;
+---------------+
|    EXPR$0     |
+---------------+
| hiring        |
| Job           |
| SkilledTrade  |
| Tucson        |
| Jobs          |
+---------------+
5 rows selected (0.168 seconds)

对于相同的结果,可以通过使用公用表表达式 (CTE,也称为子查询分解)来提高可读性:

0: jdbc:drill:zk=local> with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets)
. . . . . . . . . . . > select ent_hashtags.hashtags.text from ent_hashtags
. . . . . . . . . . . > limit 5;
+---------------+
|    EXPR$0     |
+---------------+
| hiring        |
| Job           |
| SkilledTrade  |
| Tucson        |
| Jobs          |
+---------------+
5 rows selected (0.253 seconds)

将展平的数组与现有字段结合起来,使我们能够看到类似推文列表及其相关主题标签的内容:

0: jdbc:drill:zk=local> with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp limit 10;
+------------------+------+---------+
| user_screen_name | text | hashtag |
+------------------+------+---------+
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | hiring |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | SkilledTrade |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Tucson |
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Jobs |
| johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | lol |
| johnmayberry | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw | nowthatsfunny |
| greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 | WinstonSalem |
| greensboro_nc | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 | ws |
| trendinaliaSG | 6. Hit The Stage
7. TTTT
8. Demi Lovato
9. Beijing
10. Donald Trump2016/7/23 03:36 SGT #trndnl https://t.co/psP0GzBgZB | trndnl |
+------------------+------+---------+
10 rows selected (0.166 seconds)

我们还可以根据主题标签进行过滤:

0: jdbc:drill:zk=local> with tmp as ( select flatten(tweets.entities.hashtags) as hashtags,tweets.text,tweets.`user`.screen_name as user_screen_name from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets) select tmp.user_screen_name,tmp.text,tmp.hashtags.text as hashtag from tmp where tmp.hashtags.text = 'Job' limit 5;
+------------------+------+---------+
| user_screen_name | text | hashtag |
+------------------+------+---------+
| tmj_TUC_skltrd | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs | Job |
| tmj_VAL_health | Want to work at Genesis Rehab Services? We're #hiring in #Clinton, MD! Click for details: https://t.co/4lt7I4gMZk #Job #Healthcare #Jobs | Job |
| tmj_in_retail | Want to work in #Clinton, IN? View our latest opening: https://t.co/UiimnlubYs #Job #Retail #Jobs #Hiring #CareerArc | Job |
| tmj_la_hrta | Want to work at SONIC Drive-In? We're #hiring in #Clinton, LA! Click for details: https://t.co/aQ1FrWc7iR #Job #SONIC #Hospitality #Jobs | Job |
| tmj_ia_hrta | We're #hiring! Click to apply: Department Manager - https://t.co/SnoKcwwHFk #Job #Hospitality #Clinton, IA #Jobs #CareerArc | Job |
+------------------+------+---------+
5 rows selected (0.207 seconds)

以及总结标签计数:

0: jdbc:drill:zk=local> with ent_hashtags as (select flatten(tweets.entities.hashtags) as hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets)
. . . . . . . . . . . > select ent_hashtags.hashtags.text,count(ent_hashtags.hashtags.text) from ent_hashtags
. . . . . . . . . . . > group by ent_hashtags.hashtags.text
. . . . . . . . . . . > order by 2 desc;
+-----------------------------+---------+
|           EXPR$0            | EXPR$1  |
+-----------------------------+---------+
| Trump                       | 365     |
| trndnl                      | 176     |
| job                         | 170     |
| Hiring                      | 127     |
| Clinton                     | 108     |
| Yorkshire                   | 100     |
| CareerArc                   | 100     |
[...]

要过滤掉可能没有数组值的记录(例如,不是每个推特都具有的哈希标签)并且没有查询的记录可能会失败,请对数组的第一个索引的属性使用IS NOT NULL

0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5;
+--------+
| EXPR$0 |
+--------+
| [{"text":"hiring","indices":[6,13]},{"text":"Job","indices":[98,102]},{"text":"SkilledTrade","indices":[103,116]},{"text":"Tucson","indices":[117,124]},{"text":"Jobs","indices":[129,134]}] |
| [{"text":"lol","indices":[72,76]},{"text":"nowthatsfunny","indices":[77,91]}] |
| [{"text":"WinstonSalem","indices":[0,13]},{"text":"ws","indices":[92,95]}] |
| [{"text":"trndnl","indices":[89,96]}] |
| [{"text":"trndnl","indices":[92,99]}] |
+--------+
5 rows selected (0.187 seconds)

如果您尝试比较数组本身,则无法使用:

0: jdbc:drill:zk=local> select tweets.entities.hashtags from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags is not null limit 5; Error: SYSTEM ERROR: SchemaChangeException: Failure while trying to materialize incoming schema. Errors: Error in expression at index -1. Error: Missing function implementation: [isnotnull(MAP-REPEATED)]. Full expression: --UNKNOWN EXPRESSION--.. Fragment 0:0 [Error Id: 99ac12aa-f6b4-4692-b815-8f483da682c4 on 192.168.56.1:31010] (state=,code=0)

上面的示例演示了如何使用数组索引,这是FLATTEN一种替代选择,用于访问数组中的单个对象(如果您知道它们将存在):

0: jdbc:drill:zk=local> select tweets.entities.hashtags[0].text as first_hashtag,text from dfs.`/users/rmoff/data/geo_tweets.fixed.json` tweets where tweets.entities.hashtags[0].text is not null limit 5;
+---------------+------+
| first_hashtag | text |
+---------------+------+
| hiring | We're #hiring! Click to apply: Bench Jeweler - SEC Oracle & Wetmore - https://t.co/Oe2SHaL0Hh #Job #SkilledTrade #Tucson, AZ #Jobs |
| lol | Someone built a wall around Trump's star on the Hollywood Walk of Fame. #lol #nowthatsfunny @… https://t.co/qHWuJXnzbw |
| WinstonSalem | #WinstonSalem Time and place announced for Donald Trump's visit to… https://t.co/6OVl7crshw #ws @winston_salem_ https://t.co/l5h220otj4 |

查询CSV文件

JSON文件相对易于解释,因为它们内部具有半定义的架构,包括列名。 另一方面,在可靠地推断列名称时,CSV(通常是用字符分隔的文件)更像是“狂野的西部”。 如果需要,可以配置Apache Drill忽略CSV文件的第一行(假设它是标题),也可以将它们用作列名。 如果您不这样做,则查询如下所示的CSV文件:

[oracle@bigdatalite ~]$ head nyc_parking_violations.csv
Summons Number,Plate ID,Registration State,Plate Type,Issue Date,Violation Code,Vehicle Body Type,Vehicle Make,Issuing Agency,Street Code1,Street Code2,Street Code3,Vehicle Expiration Date,Violation Location,Violation Precinct,Issuer Precinct,Issuer Code,Issuer Command,Issuer Squad,Violation Time,Time First Observed,Violation County,Violation In Front Of Or Opposite,House Number,Street Name,Intersecting Street,Date First Observed,Law Section,Sub Division,Violation Legal Code,Days Parking In Effect    ,From Hours In Effect,To Hours In Effect,Vehicle Color,Unregistered Vehicle?,Vehicle Year,Meter Number,Feet From Curb,Violation Post Code,Violation Description,No Standing or Stopping Violation,Hydrant Violation,Double Parking Violation
1360858775,PHW9801,OH,PAS,07/01/2015,20,SUBN,HONDA,P,61490,26160,26190,0,0044,44,44,929822,0044,0000,0653P,,BX,O,651,RIVER AVE,,0,408,D,,BBBBBBB,ALL,ALL,,0,0,-,0,,,,,

您将获得两个记录,每个记录为一列宽,它们是一个数组:

0: jdbc:drill:zk=local> select *  from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv` LIMIT 5;
+---------+
| columns |
+---------+
| ["Summons Number","Plate ID","Registration State","Plate Type","Issue Date","Violation Code","Vehicle Body Type","Vehicle Make","Issuing Agency","Street Code1","Street Code2","Street Code3","Vehicle Expiration Date","Violation Location","Violation Precinct","Issuer Precinct","Issuer Code","Issuer Command","Issuer Squad","Violation Time","Time First Observed","Violation County","Violation In Front Of Or Opposite","House Number","Street Name","Intersecting Street","Date First Observed","Law Section","Sub Division","Violation Legal Code","Days Parking In Effect    ","From Hours In Effect","To Hours In Effect","Vehicle Color","Unregistered Vehicle?","Vehicle Year","Meter Number","Feet From Curb","Violation Post Code","Violation Description","No Standing or Stopping Violation","Hydrant Violation","Double Parking Violation"] |
| ["1360858775","PHW9801","OH","PAS","07/01/2015","20","SUBN","HONDA","P","61490","26160","26190","0","0044","44","44","929822","0044","0000","0653P","","BX","O","651","RIVER AVE","","0","408","D","","BBBBBBB","ALL","ALL","","0","0","-","0","","","","",""] |

要访问CSV文件中的实际列,您需要使用columns[x]语法来引用它们。 请注意,各columns区分大小写,并且编号从零开始:

0: jdbc:drill:zk=local> select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv` limit 5;
+----------+--------------------+
| PlateID  | RegistrationState  |
+----------+--------------------+
| AR877A   | NJ                 |
| 73268ME  | NY                 |
| 2050240  | IN                 |
| 2250017  | IN                 |
| AH524C   | NJ                 |
+----------+--------------------+
5 rows selected (0.247 seconds)

为了使重复处理数据更加容易,您可以定义数据视图:

0: jdbc:drill:zk=local> create view dfs.tmp.NYC_Parking_01 as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`;
+-------+-----------------------------------------------------------------+
|  ok   |                             summary                             |
+-------+-----------------------------------------------------------------+
| true  | View 'NYC_Parking_01' created successfully in 'dfs.tmp' schema  |
+-------+-----------------------------------------------------------------+
1 row selected (0.304 seconds)

这使用的是dfs存储插件及其内部的tmp模式,它具有以下存储配置–请注意, writeable为true

"tmp": {"location": "/tmp","writable": true,"defaultInputFormat": null
}

(如果您使用了错误的数据库[存储插件]或架构,您将获得Schema [hdfs] is immutable.

查询新视图

0: jdbc:drill:zk=local> select * from dfs.tmp.NYC_Parking_01 limit 5;
+-----------+---------------------+
|  PlateID  |  RegistrationState  |
+-----------+---------------------+
| Plate ID  | Registration State  |
| PHW9801   | OH                  |
| K8010F    | TN                  |
| GFG6211   | NY                  |
| GHL1805   | NY                  |
+-----------+---------------------+
5 rows selected (0.191 seconds)

通过视图或直接针对CSV路径,您还可以运行聚合:

0: jdbc:drill:zk=local> select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) > 1 limit 1;
+----------+---------+
| PlateID  | EXPR$1  |
+----------+---------+
| 2050240  | 4       |
+----------+---------+
1 row selected (15.983 seconds)

尽管对于相同的结果,这不能重新运行-可能是由于limit子句

0: jdbc:drill:zk=local> select PlateID,count(*) from dfs.tmp.NYC_Parking_01 group by PlateID having count(*) > 1 limit 1;
+----------+---------+
| PlateID  | EXPR$1  |
+----------+---------+
| AR877A   | 3       |
+----------+---------+
1 row selected (12.881 seconds)

在幕后,视图定义被写入/tmp –如果要在重新启动后保留此数据,则需要移动此路径:

[oracle@bigdatalite parking]$ cat /tmp/NYC_Parking_01.view.drill
{"name" : "NYC_Parking_01","sql" : "SELECT `columns`[1] AS `PlateID`, `columns`[2] AS `RegistrationState`\nFROM `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`","fields" : [ {"name" : "PlateID","type" : "ANY","isNullable" : true}, {"name" : "RegistrationState","type" : "ANY","isNullable" : true} ],"workspaceSchemaPath" : [ "hdfs" ]

您还可以使用CTAS(选择时创建表)创建实际表:

0: jdbc:drill:zk=local> create table dfs.tmp.parking as select columns[1] as `PlateID`, columns[2] as `RegistrationState` from `/user/oracle/incoming/nyc_parking/nyc_parking_violations.csv`;
+-----------+----------------------------+
| Fragment  | Number of records written  |
+-----------+----------------------------+
| 1_1       | 4471875                    |
| 1_0       | 4788421                    |
+-----------+----------------------------+
2 rows selected (42.913 seconds)

它存储在磁盘上(按dfs配置),默认情况下以Parquet格式存储:

[oracle@bigdatalite parking]$ ls -l /tmp/parking/
total 76508
-rw-r--r--. 1 oracle oinstall 40623288 Aug 10 22:53 1_0_0.parquet
-rw-r--r--. 1 oracle oinstall 37717804 Aug 10 22:53 1_1_0.parquet

Drill的Web界面

Drill带有Web界面,您可以从http://访问

:8047 /对

  • 发出查询
    Introduction-apache-drill-1
    Introduction-apache-drill-2
  • 配置其他存储插件(例如, 数据库 , hdfs等)
    Introduction-apache-drill-3
  • 指标和调试
    Introduction-apache-drill-4
    Introduction-apache-drill-5

定义存储插件

在Drill Web界面中,您可以查看现有的存储插件或定义新的存储插件。 要创建一个新文件,请在“存储”页面上的“ 新存储插件”下输入其名称(例如hdfs ,但可能会引起fred ,它只是一个标签),然后单击“创建”。 在“配置”框中粘贴必要的JSON定义,然后单击“创建”。 如果您不想使用GUI,那么还有REST API。

存储插件配置存储在Zookeeper中(运行分布式Drill时),或者独立运行时在sys.store.provider.local.path路径中本地sys.store.provider.local.path 。 默认情况下,该/tmp位于/tmp下,该/tmp在服务器重新启动时被清除。 要保留自定义存储配置,请修改drill-override.confsys.store.provider.local.path ,例如:

drill.exec: {cluster-id: "drillbits1",zk.connect: "localhost:2181"sys.store.provider.local.path="/home/oracle/drill/"
}

处理文件系统数据

这是一个使Drill能够访问CDH群集的HDFS的存储配置示例:

{"type": "file","enabled": true,"connection": "hdfs://cdh57-01-node-01:8020/","config": null,"workspaces": {"root": {"location": "/","writable": true,"defaultInputFormat": null}},"formats": {"csv": {"type": "text","extensions": ["csv"],"delimiter": ","},"json": {"type": "json","extensions": ["json"]}}
}

除了HDFS的connection参数本身外,此配置中的重要部分是formats部分。 这就告诉Drill如何处理找到的文件,而最终用户不必显式声明其类型。

对于基于文件系统的插件dfs (可以包括本地文件,HDFS甚至Amazon S3),您可以浏览可用的“表”:

列出HDFS中的文件(以前use hdfs;选择)

0: jdbc:drill:zk=local> show files;
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
|  name  | isDirectory  | isFile  | length  | owner  |    group    | permissions  |       accessTime       |     modificationTime     |
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
| hbase  | true         | false   | 0       | hbase  | supergroup  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-07-25 14:46:08.212  |
| share  | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwx    | 1969-12-31 19:00:00.0  | 2016-05-15 12:28:08.152  |
| solr   | true         | false   | 0       | solr   | solr        | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-06-01 18:34:50.716  |
| tmp    | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwt    | 1969-12-31 19:00:00.0  | 2016-06-24 04:54:41.491  |
| user   | true         | false   | 0       | hdfs   | supergroup  | rwxrwxrwx    | 1969-12-31 19:00:00.0  | 2016-06-21 15:55:59.084  |
| var    | true         | false   | 0       | hdfs   | supergroup  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-11 17:53:29.804  |
+--------+--------------+---------+---------+--------+-------------+--------------+------------------------+--------------------------+
6 rows selected (0.145 seconds)

显示给定路径中的文件:

0: jdbc:drill:zk=local> show files in `/user/oracle`;
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
|      name      | isDirectory  | isFile  | length  |  owner  |  group  | permissions  |       accessTime       |     modificationTime     |
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
| .Trash         | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-23 20:42:34.815  |
| .sparkStaging  | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-07-06 03:56:38.863  |
| .staging       | true         | false   | 0       | oracle  | oracle  | rwx------    | 1969-12-31 19:00:00.0  | 2016-06-01 18:37:04.005  |
| incoming       | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-08-03 05:34:12.38   |
| mediademo      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-06-01 18:59:45.653  |
| moviedemo      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:02:55.652  |
| moviework      | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.497  |
| oggdemo        | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.552  |
| oozie-oozi     | true         | false   | 0       | oracle  | oracle  | rwxr-xr-x    | 1969-12-31 19:00:00.0  | 2016-05-15 12:03:01.651  |
+----------------+--------------+---------+---------+---------+---------+--------------+------------------------+--------------------------+
9 rows selected (0.428 seconds)

您还可以通过指定通配符匹配来查询多个文件。 以下是可用文件的截断列表:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> show files in `hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/`;
+--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+
|           name           | isDirectory  | isFile  |  length  | owner  | group  | permissions  |        accessTime        |     modificationTime     |
+--------------------------+--------------+---------+----------+--------+--------+--------------+--------------------------+--------------------------+
| FlumeData.1466176113171  | false        | true    | 1055675  | rmoff  | rmoff  | rw-r--r--    | 2016-08-10 21:28:27.072  | 2016-06-17 16:08:38.023  |
| FlumeData.1466176113172  | false        | true    | 1051411  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.756  | 2016-06-17 16:08:40.597  |
| FlumeData.1466176113173  | false        | true    | 1054734  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.752  | 2016-06-17 16:08:43.33   |
| FlumeData.1466176113174  | false        | true    | 1050991  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.743  | 2016-06-17 16:08:44.361  |
| FlumeData.1466176113175  | false        | true    | 1053577  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.748  | 2016-06-17 16:08:45.162  |
| FlumeData.1466176113176  | false        | true    | 1051965  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.752  | 2016-06-17 16:08:46.261  |
| FlumeData.1466176113177  | false        | true    | 1049555  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:47.425  |
| FlumeData.1466176113178  | false        | true    | 1050566  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:48.23   |
| FlumeData.1466176113179  | false        | true    | 1051751  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.756  | 2016-06-17 16:08:49.381  |
| FlumeData.1466176113180  | false        | true    | 1052249  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.757  | 2016-06-17 16:08:50.042  |
| FlumeData.1466176113181  | false        | true    | 1055002  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:50.896  |
| FlumeData.1466176113182  | false        | true    | 1050812  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.758  | 2016-06-17 16:08:52.191  |
| FlumeData.1466176113183  | false        | true    | 1048954  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.757  | 2016-06-17 16:08:52.994  |
| FlumeData.1466176113184  | false        | true    | 1051559  | rmoff  | rmoff  | rw-r--r--    | 2016-08-05 20:46:51.773  | 2016-06-17 16:08:54.025  |
[...]

计算一个文件( FlumeData.1466176113171 )中的记录数:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171`(type => 'json'));
+---------+
| EXPR$0  |
+---------+
| 277     |
+---------+
1 row selected (0.798 seconds)

在多个文件( FlumeData.146617611317* )中:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.146617611317*`(type => 'json'));
+---------+
| EXPR$0  |
+---------+
| 2415    |
+---------+
1 row selected (2.466 seconds)

在文件夹( * )中的所有文件中:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/*`(type => 'json'));
+---------+
| EXPR$0  |
+---------+
| 7414    |
+---------+
1 row selected (3.867 seconds)

甚至跨多个文件夹:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT count(*) FROM table(`hdfs`.`/user/flume/incoming/twitter/2016/06/*/*`(type => 'json'));
+---------+
| EXPR$0  |
+---------+
| 206793  |
+---------+
1 row selected (87.545 seconds)

查询不带识别扩展名的数据

Drill依赖于orer中存储扩展配置的format子句来确定如何根据文件的扩展名解释文件。 您将不会总是拥有可用的或已定义的扩展。 如果尝试查询此类数据,您将走不远。 在此示例中,我在JSON格式但没有.json后缀的HDFS上查询数据:

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT text FROM `hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171` limit 5;
Error: VALIDATION ERROR: From line 1, column 18 to line 1, column 23: Table 'hdfs./user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171' not foundSQL Query null

不用担心–您可以将它们声明为查询语法的一部分。

0: jdbc:drill:zk=cdh57-01-node-01.moffatt.me:> SELECT text FROM table(`hdfs`.`/user/rmoff/incoming/twitter/2016/06/17/tweets/FlumeData.1466176113171`(type => 'json')) limit 5;
+------+
| text |
+------+
| RT @jjkukrl: susu bayi jg lagi mahal nih ugh ayah harus semangat cari duit ^^9 https://t.co/2NvTOShRbI |
| Oracle Java 1Z0-808 Web Exam Simulator https://t.co/tZ3gU8EMj3 |
| @TribuneSelatan ahaha kudu gaya atuh da arek lebarann ahahaha |
| Short impression of yesterday's speech. What a great day it was! #lifeatoracle #team #salesincentive #oracle https://t.co/SVK2ovOe3U |
| Want to work at Oracle? We're #hiring in New York! Click for details: https://t.co/NMTo1WMHVw #Sales #Job #Jobs #CareerArc |
+------+
5 rows selected (1.267 seconds)

存储配置– Oracle

根据文档,可以很容易地查询RDBMS(例如Oracle)中的数据。 只需将JDBC驱动程序复制到Apache Drill的jar文件夹中:

cp /u01/app/oracle/product/12.1.0.2/dbhome_1/jdbc/lib/ojdbc7.jar /opt/apache-drill-1.7.0/jars/3rdparty/

然后添加必要的存储配置,我称之为ora:

{"type": "jdbc","driver": "oracle.jdbc.OracleDriver","url": "jdbc:oracle:thin:moviedemo/welcome1@localhost:1521/ORCL","username": null,"password": null,"enabled": true
}

如果遇到错误, Please retry: error (unable to create/ update storage)然后检查目标Oracle数据库是否已启动,密码是否正确,等等。

然后,您可以在Hive中查询数据:

0: jdbc:drill:zk=local> use ora.MOVIEDEMO;
+-------+--------------------------------------------+
|  ok   |                  summary                   |
+-------+--------------------------------------------+
| true  | Default schema changed to [ora.MOVIEDEMO]  |
+-------+--------------------------------------------+
1 row selected (0.205 seconds)0: jdbc:drill:zk=local> show tables;
+----------------+-----------------------------+
|  TABLE_SCHEMA  |         TABLE_NAME          |
+----------------+-----------------------------+
| ora.MOVIEDEMO  | ACTIVITY                    |
| ora.MOVIEDEMO  | BDS_CUSTOMER_RFM            |
| ora.MOVIEDEMO  | BUSINESS_REVIEW_SUMMARY     |
[...]0: jdbc:drill:zk=local> select * from ACTIVITY limit 5;
+--------------+---------+
| ACTIVITY_ID  |  NAME   |
+--------------+---------+
| 3.0          | Pause   |
| 6.0          | List    |
| 7.0          | Search  |
| 8.0          | Login   |
| 9.0          | Logout  |
+--------------+---------+
5 rows selected (1.644 seconds)

如果出现错误:DATA_READ错误:尝试设置SQL查询时JDBC存储插件失败。 然后在Apache Drill中启用详细错误以查看问题所在:

0: jdbc:drill:zk=local> ALTER SESSION SET `exec.errors.verbose` = true;
+-------+-------------------------------+
|  ok   |            summary            |
+-------+-------------------------------+
| true  | exec.errors.verbose updated.  |
+-------+-------------------------------+
1 row selected (0.154 seconds)0: jdbc:drill:zk=local> select * from ora.MOVIEDEMO.YELP_BUSINESS limit 1;
Error: DATA_READ ERROR: The JDBC storage plugin failed while trying setup the SQL query.sql SELECT *
FROM "MOVIEDEMO"."YELP_BUSINESS"
plugin ora
Fragment 0:0[Error Id: 40343dd5-1354-48ed-90ef-77ae1390411b on bigdatalite.localdomain:31010](java.sql.SQLException) ORA-29913: error in executing ODCIEXTTABLEOPEN callout
ORA-29400: data cartridge error
KUP-11504: error from external driver: MetaException(message:Could not connect to meta store using any of the URIs provided. Most recent failure: org.apache.thrift.transport.TTransportException: java.net.ConnectException: Connection refused

这是Oracle正在查询的外部表的问题( ORA-29913: error in executing ODCIEXTTABLEOPEN )。 它实际上是Hive表上的Oracle外部表,显然Drill可以直接查询-但是,我们只是在这里进行沙箱测试…

查询执行

正如Oracle具有基于成本的优化程序(CBO)来帮助确定如何执行查询以及最有效地执行查询一样,Apache Drill的执行引擎也可以确定如何实际执行您提供的查询。 这还包括如何将其拆分为多个节点(“钻头”)(如果可用),以及优化(例如在某些情况下进行分区修剪) 。 您可以在此处详细了解查询执行的工作方式 ,并在此处观看有关它的详细说明 。 要查看查询的解释计划,请使用explain plan

0: jdbc:drill:zk=local> !set maxwidth 10000  
0: jdbc:drill:zk=local> explain plan for select `date`,count(*) as tip_count from `/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json` group by `date` order by 2 desc limit 5;  
+------+------+
| text | json |
+------+------+
| 00-00    Screen
00-01      Project(date=[$0], tip_count=[$1])  
00-02        SelectionVectorRemover  
00-03          Limit(fetch=[5])  
00-04            SelectionVectorRemover  
00-05              TopN(limit=[5])  
00-06                HashAgg(group=[{0}], tip_count=[$SUM0($1)])  
00-07                  HashAgg(group=[{0}], tip_count=[COUNT()])  
00-08                    Scan(groupscan=[EasyGroupScan [selectionRoot=hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json, numFiles=1, columns=[`date`], files=[hdfs://localhost:8020/user/oracle/incoming/yelp/tip_json/yelp_academic_dataset_tip.json]]])  | {"head" : {"version" : 1,"generator" : {"type" : "ExplainHandler","info" : ""[...]

您还可以使用Drill Web界面查看有关查询执行方式的信息:

Introduction-apache-drill-6

Introduction-apache-drill-7

钻探资源管理器

MapR Drill ODBC驱动程序附带一个名为Drill Explorer的工具。 这是一个GUI,使您可以通过导航数据库(==存储插件)和其中的文件夹/文件,预览数据甚至创建视图来浏览数据。

Introduction-apache-drill-8

钻探客户

在Drill客户端中,有多种可用设置:

0: jdbc:drill:zk=local> !set  
autocommit          true  
autosave            false  
color               true  
fastconnect         true  
force               false  
headerinterval      100  
historyfile         /home/oracle/.sqlline/history  
incremental         true  
isolation           TRANSACTION_REPEATABLE_READ  
maxcolumnwidth      15  
maxheight           56  
maxwidth            1000000  
numberformat        default  
outputformat        table  
propertiesfile      /home/oracle/.sqlline/sqlline.properties  
rowlimit            0  
showelapsedtime     true  
showheader          true  
shownestederrs      false  
showwarnings        true  
silent              false  
timeout             -1  
trimscripts         true  
verbose             false

要更改一个,例如显示的输出宽度:

0: jdbc:drill:zk=local> !set maxwidth 10000

要连接到远程Drill,请指定存储Drillbit连接信息的Zookeeper节点:

rmoff@asgard-3:apache-drill-1.7.0> bin/sqlline -u jdbc:drill:zk=cdh57-01-node-01.moffatt.me:2181,cdh57-01-node-02.moffatt.me:2181,cdh57-01-node-03.moffatt.me:2181

结论

Apache Drill是一个功能强大的工具,用于针对不同的数据源使用熟悉的查询语言(SQL)。 在小范围内,仅能够通过JSON之类的结构化文件进行切片和切块是一个巨大的胜利。 在更大范围内,尝试跨机器集群查询更大数量的数据时,也许与诸如Impala之类的工具进行比较时,尝试Apache Drill进行比较是很有意思的。

有关Apache Drill的更多信息,请参见如何从OBIEE中访问Drill。

翻译自: https://www.javacodegeeks.com/2016/09/guide-getting-started-apache-drill.html

drill apache

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/335602.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

mac mysql 重设密码_Mac下忘记mysql密码重新设置密码的图文教程

MySQL 文件在路径/usr/local/mysql下1&#xff0c; 在系统偏好设置中关闭 mysql &#xff1a; Stop MySQL Server2 &#xff0c;打开终端进入路径  /usr/local/mysql/bin输入命令 sudo su&#xff0c; 然后输入开机密码。然后输入命令&#xff1a;./mysqld_safe –skip-grant-…

activiti 变量_如何在Activiti中使用瞬态变量

activiti 变量我们昨天发布的Activiti v6 Beta3中已经加入了很多需要的功能-临时变量。 在这篇文章中&#xff0c;我将向您展示一个示例&#xff0c;说明如何使用瞬态变量来覆盖一些以前不可能&#xff08;或最佳&#xff09;的高级用例。 到目前为止&#xff0c;Activiti中的…

erosa mysql_MySQL协议和canal实现

前言前面的文章里&#xff0c;我们了解到 canal 可以从 MySQL 中感知数据的变化。这是因为它模拟 MySQL slave 的交互协议&#xff0c;伪装自己为 MySQL slave &#xff0c;从而实现了主从复制。正是了解到这一点&#xff0c;笔者有两个问题便一直萦绕于心&#xff1a;它是如何…

HH SaaS电商系统的供应商系统设计

供应商信息结构图 供应商类型 商城的供应商划分为专享型、共享型两种&#xff0c;但是租户和店铺供应商则都是“专享型”的。 共享型供应商发布的商品归属供应商自己的&#xff0c;商品档案供应商才有资格管理&#xff0c;所以spu_base需要保存供应商id&#xff0c;有供应商id…

aws faas_带有AWS Lambda和Java的无服务器FaaS

aws faas什么是无服务器架构&#xff1f; 无服务器架构在由第三方完全管理的临时容器中运行自定义代码。 自定义代码通常只是完整应用程序的一小部分。 也称为函数 。 这为无服务器架构提供了另一个名称&#xff0c;即功能即服务 &#xff08;FaaS&#xff09;。 该容器是短暂的…

跨境商品的进口税额显示

跨境商品的采购类型有三种&#xff1a;直邮、保税、一般贸易&#xff0c;而一般贸易的商品已经清关入境了&#xff0c;虽然是跨境商品&#xff0c;但是无需再清关&#xff0c;所以商品详情页无需显示进口税相关信息。 直邮跨境商品显示的进口税信息如下图所示&#xff1a; 保税…

coreldraw x8段落_CDR X8设置自定义文字为默认字体(二)

通过上一篇文章的介绍&#xff0c;我们已经了解到了在CorelDRAW中如何自定义设置默认字体&#xff0c;相关阅读可参阅&#xff1a;CDR X8设置文字为默认字体。其实在CorelDRAW软件中给用户提供方式不止是一种&#xff0c;本文将介绍更多关于设置默认字体的方法。1. 打开CorelDR…

jhipster_JHipster入门,第3部分

jhipster欢迎回到本JHipster教程系列&#xff01; 在第一部分中&#xff0c;我们介绍了如何创建整体应用程序。 在第二部分中 &#xff0c;我们逐步创建了一个微服务应用程序&#xff08;这有点复杂&#xff09;。 对于那些正在努力使JHipster正常运转的人&#xff0c;我想着重…

jhipster_JHipster入门,第2部分

jhipster所以你回来了&#xff01; 在本系列的最后一部分中 &#xff0c;我们采用了单片路线创建了一个JHipster应用程序。 这是红色药丸路线&#xff1b; 生活几乎与您习惯的一样。 但是也许您喜欢挑战。 也许您想超越红色药丸并尝试蓝色药丸。 在这种情况下&#xff0c;Blue…

HH SaaS电商系统的虚拟资金账户(钱包余额)设计

文章目录方案一&#xff0c;将资金账户抽象出来虚拟资金账户余额流水记录实体方案二&#xff0c;用户表直接保存资金余额余额流水记录实体方案一&#xff0c;将资金账户抽象出来 虚拟资金账户 P.S. 如果机构代码和消费代码有区分类型&#xff0c;那么资金账户表中就不必保存“…

jhipster_jHipster入门,第1部分

jhipster因此&#xff0c;您想保持技术的领先地位&#xff0c;但对所有活动部件感到不知所措。 你真幸运&#xff01; 这就是jHipster发光的地方。 如果您喜欢Ruby on Rails或Grails的方法来快速启动和运行应用程序&#xff0c;那么这可能是适合您的选择。 jHipster旨在使设置…

java基准测试_微基准测试进入Java 9

java基准测试我已经几个月没有在这里写文章了&#xff0c;这种例外还会继续。 我计划在明年三月左右恢复写作。 本文末尾的说明。 等待&#xff01; 不完全是最后&#xff0c;因为您可以向下滚动。 它在文章结尾处。 继续阅读&#xff01; 三年前&#xff0c;我在写有关Java编…

python周天为一周的开始_从两个月减肥四十斤说起,不懈坚持,才是一个人最核心的竞争力。...

大学毕业了十年&#xff0c;我的身材失控了十年。我对我的胖身材已经习以为常了十年了&#xff0c;我从来没想过改变&#xff0c;因为我也没有毅力改变。有一句话说的&#xff0c;人越胖&#xff0c;就越懒。因为我胖&#xff0c;所以我不跑步&#xff0c;我也不爱外出。只要你…

java实现线程的方式_java多线程实现的四种方式

java多线程实现的四种方式1、继承Thread类&#xff0c;重写run方法(其实Thread类本身也实现了Runnable接口)2、实现Runnable接口&#xff0c;重写run方法3、实现Callable接口&#xff0c;重写call方法(有返回值)4、使用线程池(有返回值)1、继承Thread类&#xff0c;重写run方法…

本地运行flowable_在CockroachDB上运行Flowable

本地运行flowable什么是CockroachDB&#xff1f; CockroachDB是一个我一直关注了很长时间的项目。 这是一个开放源代码的Apache 2许可数据库&#xff08; Github链接 &#xff09;&#xff0c;该数据库在很大程度上汲取了Google Spanner白皮书的启发 。 它的核心是可水平扩展的…

win10禁用驱动程序强制签名_图文细说 win10系统未检测到第三个监视器的途径 -win10使用教程...

据统计调查&#xff0c;好多朋友都在为图文细说 win10系统未检测到第三个监视器的途径 的问题而烦恼&#xff0c;小编决定将解决的办法分享给大家。对于电脑菜鸟而言看到图文细说 win10系统未检测到第三个监视器的途径 是很无奈的&#xff0c;试着独自一个人解决图文细说 win10…

安装openstack_午餐前如何安装OpenStack Cloud

安装openstack图1. QuickStart的内部工作原理 云安装程序 如果我告诉您可以在必须停下来吃午餐之前进行OpenStack Cloud环境设置&#xff0c;该怎么办&#xff1f; 您会感到惊讶吗&#xff1f; 你今天可以做吗&#xff1f; 在大多数情况下&#xff0c;我敢打赌您的答案是不…

5e怎么绑定一键跳投_怎么开通淘宝亲情账号 淘宝亲情账号申请步骤【详解】

据官方消息&#xff0c;淘宝将在2月1日上线“亲情账号”功能。那么 淘宝亲情账号怎么开通?在哪申请? 下面为大家介绍下&#xff0c;一起来看看。亲情账号用户在淘宝账号中可以创建“父亲”或“母亲”角色关系&#xff0c;对父母发出注册邀请或直接绑定他们的账号。完成亲情关…

windows功能_你的Windows杀毒软件有这个功能吗?

安全软件首推-火绒&#xff0c;良心之作。比起360安全卫士、腾讯电脑管家&#xff0c;它无广告&#xff0c;无捆绑&#xff0c;无劫持&#xff0c;无弹窗&#xff0c;还免费。我认为最喜人最贴心的功能属它的‘弹窗拦截’了&#xff0c;可以拦截程序推送烦人的弹窗。五大浓缩亮…