mysql 虚拟列索引_使用MySQL 5.7虚拟列提高查询效率

原标题:使用MySQL 5.7虚拟列提高查询效率

导读

翻译团队:星耀队@知数堂

团队成员:星耀队-芬达,星耀队-顺子,星耀队-M哥

原文出处:https://www.percona.com/blog/2018/01/29/using-generated-columns-in-mysql-5-7-to-increase-query-performance/

原文作者:Alexander Rubin

在这篇博客中,我们将看看如何使用MySQL 5.7的虚拟列来提高查询性能。

In this blog post, we’ll look at ways you can use MySQL 5.7 generated columns (or virtual columns) to improve query performance.

说明

大约两年前,我发表了一个在MySQL5.7版本上关于虚拟列的文章。从那时开始,它成为MySQL5.7发行版当中,我最喜欢的一个功能点。原因很简单:在虚拟列的帮助下,我们可以创建间接索引(fine-grained indexes),可以显著提高查询性能。我要告诉你一些技巧,可以潜在地解决那些使用了GROUP BY 和 ORDER BY而慢的报表查询。

About two years ago I published a blog post about Generated (Virtual) Columns in MySQL 5.7. Since then, it’s been one of my favorite features in the MySQL 5.7 release. The reason is simple: with the help of virtual columns, we can create fine-grained indexes that can significantly increase query performance. I’m going to show you some tricks that can potentially fix slow reporting queries with GROUP BY and ORDER BY.

问题

最近我正在协助一位客户,他正挣扎于这个查询上:

Recently I was working with a customer who was struggling with this query:

SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', AVG(ExecutionTime) as 'Avg. Execution Time', COUNT(distinct AccountId) as 'No. Of Accounts', COUNT(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY CONCAT(verb, ' - ', replace(url,'.xml','')) HAVING COUNT(*) >= 1 ;

这个查询运行了一个多小时,并且使用和撑满了整个 tmp目录(需要用到临时文件完成排序)。

The query was running for more than an hour and used all space in the tmp directory (with sort files).

表结构如下:

The table looked like this:

CREATE TABLE `ApiLog` (`Id` int(11) NOT NULL AUTO_INCREMENT,`ts` timestamp DEFAULT CURRENT_TIMESTAMP,`ServerName` varchar(50) NOT NULL default '',`ServerIP` varchar(50) NOT NULL default '',`ClientIP` varchar(50) NOT NULL default '',`ExecutionTime` int(11) NOT NULL default 0,`URL` varchar(3000) NOT NULL COLLATE utf8mb4_unicode_ci NOT NULL,`Verb` varchar(16) NOT NULL,`AccountId` int(11) NOT NULL,`ParentAccountId` int(11) NOT NULL,`QueryString` varchar(3000) NOT NULL,`Request` text NOT NULL,`RequestHeaders` varchar(2000) NOT NULL,`Response` text NOT NULL,`ResponseHeaders` varchar(2000) NOT NULL,`ResponseCode` varchar(4000) NOT NULL,... // other fields removed for simplicityPRIMARY KEY (`Id`),KEY `index_timestamp` (`ts`),... // other indexes removed for simplicity) ENGINE=InnoDB;

我们发现查询没有使用时间戳字段(“TS”)的索引:

We found out the query was not using an index on the timestamp field (“ts”):

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts', count(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY CONCAT(verb, ' - ', replace(url,'.xml','')) HAVING COUNT(*) >= 1G*************************** 1. row *************************** id: 1 select_type: SIMPLE table: ApiLog partitions: NULL type: ALLpossible_keys: ts key: NULL key_len: NULL ref: NULL rows: 22255292 filtered: 50.00 Extra: Using where; Using filesort1 row in set, 1 warning (0.00 sec)

原因很简单:符合过滤条件的行数太大了,以至于影响一次索引扫描扫描的效率(或者至少优化器是这样认为的):

The reason for that is simple: the number of rows matching the filter condition was too large for an index scan to be efficient (or at least the optimizer thinks that):

mysql> select count(*) from ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' ;+----------+| count(*) |+----------+| 7948800 |+----------+1 row in set (2.68 sec)

总行数:21998514。查询需要扫描的总行数的36%(7948800/21998514)(译者按:当预估扫描行数超过20% ~ 30%时,即便有索引,优化器通常也会强制转成全表扫描)。

Total number of rows: 21998514. The query needs to scan 36% of the total rows (7948800 / 21998514).

在这种情况下,我们有许多处理方法:

创建时间戳列和GROUP BY列的联合索引;

创建一个覆盖索引(包含所有查询字段);

仅对GROUP BY列创建索引;

创建索引松散索引扫描。

In this case, we have a number of approaches:

Create a combined index on timestamp column + group by fields

Create a covered index (including fields that are selected)

Create an index on just GROUP BY fields

Create an index for loose index scan

然而,如果我们仔细观察查询中“GROUP BY”部分,我们很快就意识到,这些方案都不能解决问题。以下是我们的GROUP BY部分:

However, if we look closer at the “GROUP BY” part of the query, we quickly realize that none of those solutions will work. Here is our GROUP BY part:

GROUP BY CONCAT(verb, ' - ', replace(url,'.xml',''))

这里有两个问题:

它是计算列,所以MySQL不能扫描verb + url的索引。它首先需要连接两个字段,然后组成连接字符串。这就意味着用不到索引;

URL被定义为“varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL”,不能被完全索引(即使在全innodb_large_prefix= 1 参数设置下,这是UTF8启用下的默认参数)。我们能做部分索引,这对GROUP BY的sql优化并没有什么帮助。

There are two problems here:

It is using a calculating field, so MySQL can’t just scan the index on verb + url. It needs to first concat two fields, and then group on the concatenated string. That means that the index won’t be used.

The URL is declared as “varchar(3000) COLLATE utf8mb4_unicode_ci NOT NULL” and can’t be indexed in full (even with innodb_large_prefix=1 option, which is the default as we have utf8 enabled). We can only do a partial index, which won’t be helpful for GROUP BY optimization.

在这里,我尝试去对URL列添加一个完整的索引,在innodb_large_prefix=1参数下:

Here, I’m trying to add a full index on the URL with innodb_large_prefix=1:

mysql> alter table ApiLog add key verb_url(verb, url);ERROR 1071 (42000): Specified key was too long; max key length is 3072 bytes

嗯,通过修改“GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”为 “GROUP BY verb, url”会帮助(假设我们把字段定义从varchar(3000)调小一些,不管业务上允许或不允许)。然而,这将改变结果,因URL字段不会删除.xml扩展名了。

Well, changing the “GROUP BY CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))” to “GROUP BY verb, url” could help (assuming that we somehow trim the field definition from varchar(3000) to something smaller, which may or may not be possible). However, it will change the results as it will not remove the .xml extension from the URL field.

解决方案

好消息是,在MySQL 5.7中我们有虚拟列。所以我们可以在“CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”之上创建一个虚拟列。最好的部分:我们不需要执行一组完整的字符串(可能大于3000字节)。我们可以使用MD5哈希(或更长的哈希,例如SHA1 / SHA2)作为GROUP BY的对象。

The good news is that in MySQL 5.7 we have virtual columns. So we can create a virtual column on top of “CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”))”. The best part: we do not have to perform a GROUP BY with the full string (potentially > 3000 bytes). We can use an MD5 hash (or longer hashes, i.e., sha1/sha2) for the purposes of the GROUP BY.

下面是解决方案:

Here is the solution:

alter table ApiLog add verb_url_hash varbinary(16) GENERATED ALWAYS AS (unhex(md5(CONCAT(verb, ' - ', replace(url,'.xml',''))))) VIRTUAL;alter table ApiLog add key (verb_url_hash);

所以我们在这里做的是:

声明虚拟列,类型为varbinary(16);

在CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”)上创建虚拟列,并且使用MD5哈希转化后再使用unhex转化32位十六进制为16位二进制;

对上面的虚拟列创建索引。

So what we did here is:

Declared the virtual column with type varbinary(16)

Created a virtual column on CONCAT(verb, ‘ – ‘, replace(url,’.xml’,”), and used an MD5 hash on top plus an unhex to convert 32 hex bytes to 16 binary bytes

Created and index on top of the virtual column

现在我们可以修改查询语句,GROUP BY verb_url_hash列:

Now we can change the query and GROUP BY verb_url_hash column:

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml',''))AS 'API Call', COUNT(*) as 'No. of API Calls',avg(ExecutionTime) as 'Avg. Execution Time',count(distinct AccountId) as 'No. Of Accounts',count(distinct ParentAccountId) as 'No. Of Parents'FROM ApiLogWHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59'GROUP BY verb_url_hashHAVING COUNT(*) >= 1;ERROR 1055 (42000): Expression #1 of SELECT list is not inGROUP BY clause and contains nonaggregated column 'ApiLog.ApiLog.Verb'which is not functionally dependent on columns in GROUP BY clause;this is incompatible with sql_mode=only_full_group_by

MySQL 5.7的严格模式是默认启用的,我们可以只针对这次查询修改一下。

现在解释计划看上去好多了:

MySQL 5.7 has a strict mode enabled by default, which we can change for that query only.

Now the explain plan looks much better:

mysql> select @@sql_mode;+-------------------------------------------------------------------------------------------------------------------------------------------+| @@sql_mode |+-------------------------------------------------------------------------------------------------------------------------------------------+| ONLY_FULL_GROUP_BY,STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION |+-------------------------------------------------------------------------------------------------------------------------------------------+1 row in set (0.00 sec)mysql> set sql_mode='STRICT_TRANS_TABLES,NO_ZERO_IN_DATE,NO_ZERO_DATE,ERROR_FOR_DIVISION_BY_ZERO,NO_AUTO_CREATE_USER,NO_ENGINE_SUBSTITUTION';Query OK, 0 rows affected (0.00 sec)mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', avg(ExecutionTime) as 'Avg. Execution Time', count(distinct AccountId) as 'No. Of Accounts', count(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY verb_url_hash HAVING COUNT(*) >= 1G*************************** 1. row *************************** id: 1 select_type: SIMPLE table: ApiLog partitions: NULL type: indexpossible_keys: ts,verb_url_hash key: verb_url_hash key_len: 19 ref: NULL rows: 22008891 filtered: 50.00 Extra: Using where1 row in set, 1 warning (0.00 sec)

MySQL可以避免排序,速度更快。它将最终还是要扫描所有表的索引的顺序。响应时间明显更好:只需大概38秒而不再是大于一小时。

MySQL will avoid any sorting, which is much faster. It will still have to eventually scan all the table in the order of the index. The response time is significantly better: ~38 seconds as opposed to > an hour.

覆盖索引

现在我们可以尝试做一个覆盖索引,这将相当大:

Now we can attempt to do a covered index, which will be quite large:

mysql> alter table ApiLog add key covered_index (`verb_url_hash`,`ts`,`ExecutionTime`,`AccountId`,`ParentAccountId`, verb, url);Query OK, 0 rows affected (1 min 29.71 sec)Records: 0 Duplicates: 0 Warnings: 0

我们添加了一个“verb”和“URL”,所以之前我不得不删除表定义的COLLATE utf8mb4_unicode_ci。现在执行计划表明,我们使用了覆盖索引:

We had to add a “verb” and “url”, so beforehand I had to remove the COLLATE utf8mb4_unicode_ci from the table definition. Now explain shows that we’re using the index:

mysql> explain SELECT CONCAT(verb, ' - ', replace(url,'.xml','')) AS 'API Call', COUNT(*) as 'No. of API Calls', AVG(ExecutionTime) as 'Avg. Execution Time', COUNT(distinct AccountId) as 'No. Of Accounts', COUNT(distinct ParentAccountId) as 'No. Of Parents' FROM ApiLog WHERE ts between '2017-10-01 00:00:00' and '2017-12-31 23:59:59' GROUP BY verb_url_hash HAVING COUNT(*) >= 1G*************************** 1. row *************************** id: 1 select_type: SIMPLE table: ApiLog partitions: NULL type: indexpossible_keys: ts,verb_url_hash,covered_index key: covered_index key_len: 3057 ref: NULL rows: 22382136 filtered: 50.00 Extra: Using where; Using index1 row in set, 1 warning (0.00 sec)

响应时间下降到约12秒!但是,索引的大小明显地比仅verb_url_hash的索引(每个记录16字节)要大得多。

The response time dropped to ~12 seconds! However, the index size is significantly larger compared to just verb_url_hash (16 bytes per record).

结论

MySQL 5.7的生成列提供一个有价值的方法来提高查询性能。如果你有一个有趣的案例,请在评论中分享。

MySQL 5.7 generated columns provide a valuable way to improve query performance. If you have an interesting case, please share in the comments.

623935c18cdc0a7147fe5a7fa45139e0.png

399e9c6a637538968e7f24ca04f932b8.png

fcd62b95251b872d1fb3084860c85c11.png

0d7b2f946cb432934e14a02795d5faca.png

知数堂

MySQL实战/MySQL优化 / Python/ SQL优化

(MySQL 实战/优化、Python开发,及SQL优化等课程)

责任编辑:

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/506182.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

python selenium span内容读取_【程仁智推荐】Selenium自动化测试入门

LupuX 2017-06-18 14:24:28 11853 收藏 41分类专栏: Auto Test 文章标签: 自动化测试 selenium web测试 UI自动化版权一、什么是SeleniumSelenium 是一个浏览器自动化测试框架,它主要用于web应用程序的自动化测试,其主要特点如下…

c++ 操作mysql_C++操作mysql方法总结(1)

C通过mysql的c api和通过mysql的Connector C 1.1.3操作mysql的两种方式使用vs2013和64位的msql 5.6.16进行操作项目中使用的数据库名为booktik表为book……….(共有30条记录,只列出了部分记录,14-30未列出)一、通过mysql的C api进行操作1、新建一个空项目…

mysql进阶知识_Mysql面试知识点总结(进阶篇)

上一篇主要介绍一些基础的mysql知识点,这一篇我们介绍一下mysql比较重要但在开发中我们程序员很少知道的几个大点(自以为是的观点)。数据库设计三范式:第一范式:数据库表的每一列都是不可分割的原子数据项,即列不可拆分。第二范式…

java实现报表_修改带 JAVA 自定义类的报表还要重启应用,咋解决?

这是 JAVA 编译型语言特性决定的,修改 JAVA 程序重启应用也正常。只不过改报表就要重启整个应用就有点夸张了,报表变动比较频繁,每次都重启应用会影响业务的。这个问题的根本原因是耦合性问题,报表里一旦涉及到 JAVA 代码就要跟主…

idea 用iterm 终端_iTerm2 都不会用,还敢自称老司机?(上)

对于需要长期与终端打交道的工程师来说,拥有一款称手的终端管理器是很有必要的,对于 Windows 用户来说,最好的选择是 Xshell,这个大家都没有异议。但对于 MacOS 用户来说,仍然毋庸置疑,iTerm2 就是你要的利…

bootstrap 日历中文_bootstrap日期选择器本地化-中文

最近用bootstrap做项目,所以就顺便搜了下用bootstrap写的日期选择器。搜到的第一和第二条结果虽然是官网,但上面挂的还是基于bootstrap2的日期选择器(此时为北京时间2017-12-26 17:18),不能与bootstrap3兼容使用。所以又去找bootstrap3的日期…

td之间的间距怎么改_论文的一级标题、二级标题格式怎么弄?

其实论文写好了以后,论文格式的调整也是非常重要的,具体的格式一般有以下几点:标题格式,一级标题、二级标题、三级标题页码格式,一般是正文之前为罗马数字,正文以后为阿拉伯数字,一般是页脚中间…

winform教_电脑绝技教你22天学精Csharp之第十五天winform应用程序补充5

{{1}}$using System;using System.Collections.Generic;using System.ComponentModel;using System.Data;using System.Drawing;using System.Linq;using System.Text;using System.Threading.Tasks;using System.Windows.Forms;using System.IO;namespace _10打开对话框{publi…

plupload怎么设置属性_腾达无线路由器怎么设置,这些是你要知道的

腾达无线路由器怎么设置1、联好线路:到你家的外网网线接路由器的WAN口,你的电脑连到路由器的LAN口(有四个,任意一个均可),给路由器接通电源。设置的时候,给路由器通电,一根网线直接连电脑和路由器的这个口就…

mysql centos 安装目录在哪_centos中如何查看mysql安装目录在哪

centos中查看mysql安装目录的方法:推荐教程:centos使用教程1、使用ps -ef|grep mysql命令查看:结果:root 17659 1 0 2011 ? 00:00:00 /bin/sh /usr/bin/mysqld_safe --datadir/var/lib/mysql --socket/var/lib/mysql/mysql.sock …

dynamic 365 js 失去焦点_基于Auto.js的QQ好友动态秒赞系统

0.脑筋急转弯请问在什么情况下log(10) 10log(20) 20左滑查看答案 console.log(10) 是 10 console.log(20) 是 201.工具选…

scrapy框架_Python:Scrapy框架

“ Scrapy是一个适用爬取网站数据、提取结构性数据的应用程序框架,它可以应用在广泛领域:Scrapy 常应用在包括数据挖掘,信息处理或存储历史数据等一系列的程序中。通常我们可以很简单的通过 Scrapy 框架实现一个爬虫,抓取指定网站…

python求两数之和的命令_数学建模:科学计算Python2小时-Python基础

这一部分主要面向数模活动中的python基础知识进行讨论作者系列文章(科学计算Python2小时)目录:李似:科学计算Python2小时-前言与目录​zhuanlan.zhihu.com首先要说明的是,目前常用的Python版本包括Python2和Python3,二者有一些语法…

空白世界地图打印版_洪恩识字卡1300字十字帖+绘本,可打印成册

洪恩识字卡电子版资源,共1300字,无拼音和升级版带拼音都有,可直接打印,可分享免费送我在app中无意洪恩识字这个宝藏app,识字是想着孩子能早日实现自由阅读,可是一直对着电子设备伤眼晴,于是找了…

操作系统实验读者写者程序源码_SAST Weekly | STM32F103系列开发板移植华为LiteOS操作系统...

SAST weekly 是由电子工程系学生科协推出的科技系列推送,内容涵盖信息领域技术科普、研究前沿热点介绍、科技新闻跟进探索等多个方面,帮助同学们增长姿势,开拓眼界,每周更新,欢迎关注!欢迎愿意分享知识的同…

java 远程调试spark_spark开启远程调试

一.集群环境配置#调试Master,在master节点的spark-env.sh中添加SPARK_MASTER_OPTS变量export SPARK_MASTER_OPTS"-Xdebug -Xrunjdwp:transportdt_socket,servery,suspendy,address10000"#调试Worker,在worker节点的spark-env.sh中添加SPARK_WO…

web中间件_常见web中间件拿shell

1.weblogic后台页面:(http为7001,https为7002)Google关键字:WebLogic Server AdministrationConsole inurl:console默认的用户名密码1、用户名密码均为:weblogic2、用户名密码均为:system3、用户名密码均为&#xff1a…

java定义抽象类abarea_详解 抽象类

本人在这篇博文中要讲解的知识点,和本人之前的一篇博文有所关联。因为,“抽象类” 是按照 “自下而上” 的顺序来编写所需的类,而在本人之前的博文《详解 继承(上)—— 工具的抽象与分层》中讲到的 继承 则与之相反,按照 “自上而…

word表格图片自动适应表格大小_Excel应用实践20:使用Excel中的数据自动填写Word表格...

学习Excel技术,关注微信公众号:excelperfect我在Excel工作表中存放着数据,如下图1所示。图1我想将这些数据逐行自动输入到Word文档的表格中并分别自动保存,Word文档表格如下图2所示,文档名为“datafromexcel.docx”。图…

dnspod java_使用dnspod遭遇的奇特问题以及背后的原因与临时解决方法

由于园子里有不少用户在使用dnspod,我们觉得有必要将这两天blogjava.net域名在dsnpod遇到的奇特问题分享一下,以免再有人踩着这个坑。12月11日,我们登录到dnspod的后台时,大吃一惊,blogjava.net这个域名竟然消失了。联…