mysql 迁移到tidb

Industry: Knowledge Sharing

行业：知识共享

Author: Mengyu Hu (Platform Engineer at Zhihu)

作者：胡梦瑜(Zhhu的平台工程师)

Zhihu which means “Do you know?” in classical Chinese, is the Quora of China: a question-and-answer website where all kinds of questions are created, answered, edited, and organized by the community of its users. As China’s biggest knowledge sharing platform, we have 220 million registered users and 30 million questions with more than 130 million answers on the site. In August 2019, we completed $450 million in F-round funding.

智虎，意思是“你知道吗？” 在古汉语中，中国的Quora是一个问答网站，由用户社区创建，回答，编辑和组织各种问题。作为中国最大的知识共享平台，我们拥有2.2亿注册用户和3000万个问题，网站上有1.3亿多个答案。 2019年8月，我们完成了4.5亿美元的F轮融资。

At Zhihu, we used MySQL as the Hive Metastore. With data growth in Hive, MySQL stored about 60 GB of data, and the largest table had more than 20 million rows of data. Although the data volume was not excessive for a standalone MySQL database, running queries or writing data in Hive caused frequent operations in Metastore. In this case, MySQL, Metastore’s backend database, became the bottleneck for the entire system. We compared multiple solutions and found that TiDB, an open-source distributed Hybrid Transactional/Analytical Processing (HTAP) database was the optimal solution. Thanks to TiDB’s elastic scalability, we can horizontally scale our metadata storage system without worrying about database capacity.

在Zhihu，我们使用MySQL作为Hive Metastore。随着Hive中数据的增长，MySQL存储了大约60 GB的数据，而最大的表具有超过2000万行的数据。尽管对于独立MySQL数据库而言，数据量并不算太大，但在Hive中运行查询或写入数据会导致Metastore中的频繁操作。在这种情况下，Metastore的后端数据库MySQL成为整个系统的瓶颈。我们比较了多种解决方案，发现TiDB ，一个开源分布式混合事务/分析处理 (HTAP)数据库是最佳的解决方案。 由于TiDB的弹性可伸缩性，我们可以水平扩展元数据存储系统，而不必担心数据库容量。

Last year, we published a post that showed how we kept our query response times at milliseconds levels despite having over 1.3 trillion rows of data. This post became a hit on various media platforms like Hacker News and DZone. Today, I’ll share with you how we use TiDB to horizontally scale Hive Metastore to meet our growing business needs.

去年，我们发表了一篇文章，展示了尽管有超过1.3万亿行数据，我们如何将查询响应时间保持在毫秒级。这篇文章在Hacker News和DZone等各种媒体平台上都很受欢迎。今天，我将与您分享我们如何使用TiDB横向扩展Hive Metastore以满足我们不断增长的业务需求。

我们的痛点 (Our pain point)

Apache Hive is a data warehouse software project built on top of Apache Hadoop that provides data query and analysis. Hive Metastore is Hive’s metadata management tool. It provides a series of interfaces for operating metadata, and its backend storage generally uses a relational database like Derby or MySQL. In addition to Hive, many computing frameworks support using Hive Metastore as a metadata center to query the data in the underlying Hadoop ecosystem, such as Presto, Spark, and Flink.

Apache Hive是建立在Apache Hadoop之上的数据仓库软件项目，可提供数据查询和分析。 Hive Metastore是Hive的元数据管理工具。它提供了用于操作元数据的一系列接口，并且其后端存储通常使用诸如Derby或MySQL之类的关系数据库。除了Hive外，许多计算框架还支持使用Hive Metastore作为元数据中心来查询基础Hadoop生态系统中的数据，例如Presto，Spark和Flink。

At Zhihu, we used MySQL as the Hive Metastore. As data grew in Hive, a single table stored more than 20 million rows of data in MySQL. When a user’s task had intensive operations in Metastore, it often ran slow or even timed out. This greatly affected task stability. If this continued, MySQL would be overwhelmed. Therefore, it was critical to optimize Hive Metastore.

在Zhihu，我们使用MySQL作为Hive Metastore。随着Hive中数据的增长，单个表在MySQL中存储了超过2000万行数据。当用户的任务在Metastore中进行大量操作时，它通常运行缓慢甚至超时。这极大地影响了任务的稳定性。如果这种情况持续下去，MySQL将不堪重负。因此，优化Hive Metastore至关重要。

To reduce MySQL’s data size and ease the pressure on Metastore, we regularly deleted metadata in MySQL. However, in practice this policy had the following drawbacks:

为了减少MySQL的数据大小并减轻对Metastore的压力，我们定期删除MySQL中的元数据。但是，实际上，此策略具有以下缺点：

Data grew much faster than it was deleted.
数据增长的速度比删除速度快得多。
When we deleted partitions of a very large partitioned table that had millions of partitions, it caused pressure on MySQL. We had to control the concurrency of such queries, and at peak hours only one query could be executed at a time. Otherwise, this would affect other operations in Metastore such as SELECT and UPDATE operations.
当我们删除具有数百万个分区的超大型分区表的分区时，这给MySQL造成了压力。我们必须控制此类查询的并发性，并且在高峰时段一次只能执行一个查询。否则，这将影响Metastore中的其他操作，例如SELECT和UPDATE操作。
At Zhihu, when metadata was deleted, the corresponding data was also deleted. (We deleted outdated data in the Hadoop Distributed File System to save costs.) In addition, Hive users would sometimes improperly create tables and set a wrong partition path. This resulted in data being deleted by mistake.
在Zhihu，删除元数据时，相应的数据也被删除。 (为了节省成本，我们删除了Hadoop分布式文件系统中的过时数据。)此外，Hive用户有时会不正确地创建表并设置错误的分区路径。这导致数据被错误地删除。

Therefore, we began to look for another solution.

因此，我们开始寻找另一种解决方案。

我们比较的解决方案 (Solutions we compared)

We compared multiple options and chose TiDB as our final solution.

我们比较了多种选择，并选择TiDB作为最终解决方案。

MySQL分片 (MySQL sharding)

We considered using MySQL sharding to balance the load of multiple MySQL databases in a cluster. However, we decided against this policy because it had these issues:

我们考虑过使用MySQL分片来平衡集群中多个MySQL数据库的负载。但是，我们决定反对此政策，因为它存在以下问题：

To shard MySQL, we would need to modify the Metastore interface to operate MySQL. This would involve a lot of high-risk changes, and it would make future Hive upgrades more complicated.
要分片MySQL，我们需要修改Metastore接口以运行MySQL。这将涉及很多高风险的更改，并使将来的Hive升级更加复杂。
Every day, we replicated MySQL data to Hive for data governance and data life cycle management. We used the internal data replication platform to replicate data. If we had used MySQL sharding, we would need to update the replication logic for the data replication platform.
每天，我们将MySQL数据复制到Hive进行数据治理和数据生命周期管理。我们使用内部数据复制平台来复制数据。如果使用MySQL分片，则需要更新数据复制平台的复制逻辑。

使用联盟扩展Metastore (Scaling Metastore using Federation)

We thought we could scale Hive Metastore using Federation. We could form an architecture that consisted of MySQL and multiple sets of Hive Metastore and add a proxy in front of Metastore to distribute requests according to certain rules.

我们认为可以使用联盟扩展Hive Metastore。我们可以形成一个包含MySQL和多组Hive Metastore的架构，并在Metastore前面添加一个代理，以根据某些规则分配请求。

But after investigation, we found this policy also had flaws:

但是经过调查，我们发现此政策也有缺陷：

To enable Federation on Hive Metastore, we wouldn’t need to modify Metastore, but we would have to maintain a set of routing components. What’s more, we need to carefully set routing rules. If we divided the existing MySQL store to different MySQL instances, divisions might be uneven. This would result in unbalanced loads among subclusters.
要在Hive Metastore上启用联合身份验证，我们不需要修改Metastore，但我们必须维护一组路由组件。此外，我们需要仔细设置路由规则。如果我们将现有MySQL存储区划分为不同MySQL实例，则划分可能会不均匀。这将导致子集群之间的负载不平衡。
Like the MySQL sharing solution, we would need to update the replication logic for the data replication platform.
像MySQL共享解决方案一样，我们将需要更新数据复制平台的复制逻辑。

具有弹性可伸缩性的TiDB是完美的解决方案 (TiDB, with elastic scalability, is the perfect solution)

TiDB is a distributed SQL database built by PingCAP and its open-source community. It is MySQL compatible and features horizontal scalability, strong consistency, and high availability. It’s a one-stop solution for both OLTP and OLAP workloads. You can learn more about TiDB’s architecture here.

TiDB是由PingCAP及其开源社区构建的分布式SQL数据库。 它与MySQL兼容，具有水平可伸缩性，强一致性和高可用性。 它是OLTP和OLAP工作负载的一站式解决方案。您可以在此处了解有关TiDB架构的更多信息。

As you recall, our problem was that when the data size increased, MySQL, limited by its standalone performance, could not deliver good performance. When individual MySQL databases formed a cluster, the complexity drastically increased. If we could find a distributed, MySQL-compatible database, we could solve this problem. Therefore, TiDB is a perfect match.

您还记得，我们的问题是，当数据大小增加时，受其独立性能限制MySQL无法提供良好的性能。当单个MySQL数据库形成集群时，复杂性急剧增加。如果我们可以找到一个分布式的，兼容MySQL的数据库，则可以解决此问题。因此，TiDB是绝配。

We chose TiDB because it had the following advantages:

我们选择TiDB是因为它具有以下优点：

TiDB is compatible with the MySQL protocol. Our tests showed that TiDB supported all inserts, deletes, updates, and selects in Metastore. Using TiDB would not bring any compatibility-related issues. Therefore, all we needed to do is dump MySQL data to TiDB.
TiDB与MySQL协议兼容。 我们的测试表明，TiDB支持Metastore中的所有插入，删除，更新和选择。使用TiDB不会带来任何与兼容性相关的问题。因此，我们要做的就是将MySQL数据转储到TiDB。
Due to its distributed architecture, TiDB far outperforms MySQL on large data sets and large number of concurrent queries.
由于其分布式架构， TiDB在大数据集和大量并发查询方面远远胜过MySQL。
TiDB has excellent horizontal scalability. It supports elastic scalability. Whether we choose MySQL sharding or Hive Metastore Federation, we could encounter bottlenecks again. Then, we would need to do sharding or Hive Metastore Federation again. But TiDB solves this problem.
TiDB具有出色的水平可伸缩性。 它支持弹性可伸缩性。无论我们选择MySQL分片还是Hive Metastore Federation，我们都可能再次遇到瓶颈。然后，我们将需要再次进行分片或Hive Metastore Federation。但是TiDB解决了这个问题。
TiDB is widely used in Zhihu, and the related technologies are relatively mature, so we can control the migration risk.
TiDB在枝湖地区被广泛使用，相关技术相对成熟，因此可以控制迁移风险。

Hive架构 (The Hive architecture)

迁移到TiDB之前 (Before migration to TiDB)

Before we migrated from MySQL to TiDB, our Hive architecture was as follows. In this architecture, Zue is a visual query interface for Zhihu’s internal use.

从MySQL迁移到TiDB之前，我们的Hive架构如下。在此体系结构中，Zue是Zhihu内部使用的可视化查询界面。

Image for post — **The Hive architecture before migration to TiDB迁移到TiDB之前的Hive架构**

迁移到TiDB之后 (After migration to TiDB)

After we migrated from MySQL to TiDB, the Hive architecture looks like this:

从MySQL迁移到TiDB之后，Hive架构如下所示：

You can see that after we migrated metadata to TiDB, the architecture has almost no change. The query requests, which were on a single MySQL node, are now distributed in the TiDB cluster. The larger the TiDB cluster, the higher the query efficiency and the greater the performance improvement.

您可以看到，在将元数据迁移到TiDB之后，该架构几乎没有变化。现在，在单个MySQL节点上的查询请求已分发到TiDB集群中。 TiDB集群越大，查询效率越高，性能提升也就越大。

迁移过程 (The migration process)

We migrated from MySQL to TiDB this way:

我们以这种方式从MySQL迁移到TiDB：

We used MySQL as the primary database and TiDB as the secondary database to replicate data from MySQL to TiDB in real time.
我们使用MySQL作为主要数据库，使用TiDB作为辅助数据库，以将数据从MySQL实时复制到TiDB。
We reduced the number of Metastore nodes to one to prevent multiple Metastore nodes from writing to MySQL and TiDB simultaneously, which would cause inconsistent metadata.
我们将Metastore节点的数量减少到一个，以防止多个Metastore节点同时写入MySQL和TiDB，这将导致元数据不一致。
During the application’s off-peak hours, we switched from the primary database to the secondary. We used TiDB as the primary and restarted Metastore.
在应用程序的非高峰时段，我们从主数据库切换到了辅助数据库。我们使用TiDB作为主要的并重新启动的Metastore。
We added back Metastore nodes.
我们添加回Metastore节点。

During the migration process, the application was not affected. Now TiDB successfully runs in our production environment.

在迁移过程中，该应用程序不受影响。现在，TiDB已在我们的生产环境中成功运行。

应用程序的运行状态 (The application’s running status)

应用程序高峰中的操作执行时间 (Operation execution time in the application peak)

We tested the database from the Hive level, simulated the application peak, and concurrently deleted and added partitions for tables with millions of partitions. We executed Hive SQL statements as follows:

我们从Hive级别测试了数据库，模拟了应用程序高峰，并为具有数百万个分区的表同时删除和添加了分区。我们执行Hive SQL语句如下：

ALTER TABLE '${table_name}' DROP IF EXISTS PARTITION(...);
ALTER TABLE '${table_name}' ADD IF NOT EXISTS PARTITION(...);

The operation execution time dropped from 45–75 seconds before migration to under 10 seconds after migration.

操作执行时间从迁移前的45-75秒减少到迁移后的10秒以下。

大型查询对数据库的影响 (The impact of large queries on the database)

From the Metastore level, we tested some of the SQL statements submitted by Metastore, especially SQL statements that would cause great pressure on the Metastore, for example:

从Metastore级别，我们测试了Metastore提交的一些SQL语句，特别是会对Metastore造成很大压力SQL语句，例如：

SELECT `A0`.`PART_NAME`,`A0`.`PART_NAME` AS `NUCORDER0` FROM `PARTITIONS` `A0` LEFT OUTER JOIN `TBLS` `B0` ON `A0`.`TBL_ID` = `B0`.`TBL_ID` LEFT OUTER JOIN `DBS` `C0` ON `B0`.`DB_ID` = `C0`.`DB_ID` WHERE `C0`.`NAME` = '${database_name}' AND `B0`.`TBL_NAME` = '${table_name}' ORDER BY `NUCORDER0`

When the number of partitions of a Hive table was very large, this SQL statement would trigger great pressure on the Metastore. Before migration to TiDB, the execution time of this type of SQL statement in MySQL was 30–40 seconds. After migration, the execution time was 6–7 seconds. What a remarkable improvement!

当Hive表的分区数量很大时，此SQL语句将对Metastore施加巨大压力。在迁移到TiDB之前，MySQL中此类SQL语句的执行时间为30-40秒。迁移后，执行时间为6-7秒。多么了不起的进步！

复制时间 (Replication time)

The safety data sheet (SDS) table, with more than 10 million rows of data, is one of the biggest tables in Metastore. The replication time of the SDS table in Metastore on the data replication platform was reduced from 90 seconds to 15 seconds.

安全数据表(SDS)表具有超过一千万行的数据，是Metastore中最大的表之一。数据复制平台上Metastore中SDS表的复制时间从90秒减少到15秒。

下一步是什么 (What’s next)

In the Hive Metastore case, TiDB helps us horizontally scale our metadata storage database so we no longer need to worry about our database storage capacity. We hope that, in the future, TiDB can provide cross-data center (DC) services: through the cross-DC deployment of data replicas, TiDB could connect online and offline scenarios so we can do real-time extract, transform, load (ETL) tasks offline without causing pressure on online services. This will improve offline ETL tasks’ real-time performance. Therefore, we’re developing TiBigData.

在Hive Metastore案例中，TiDB帮助我们水平扩展了元数据存储数据库，因此我们不再需要担心数据库的存储容量。我们希望，将来TiDB可以提供跨数据中心(DC)服务：通过跨DC部署数据副本，TiDB可以连接在线和离线场景，以便我们可以进行实时提取，转换，加载( ETL)任务脱机而不会给在线服务造成压力。这将提高离线ETL任务的实时性能。因此，我们正在开发TiBigData 。

This project was initiated by Xiaoguang Sun, a TiKV Maintainer at Zhihu. Currently, it’s an incubation project in PingCAP Incubator. PingCAP Incubator aims to establish an incubation system for TiDB ecological open-source projects. For all projects, see pingcap-incubator on GitHub. You can check out PingCAP Incubator’s documentation here.

该项目由Zhihu的TiKV维护者Sun Xiaoguang发起。目前，这是PingCAP Incubator中的孵化项目。 PingCAP孵化器旨在为TiDB生态开源项目建立一个孵化系统。对于所有项目，请参阅GitHub上的pingcap-incubator 。您可以在此处查看PingCAP孵化器的文档。

The TiBigData project has provided the read-only support of Presto and Flink for TiDB. In the future, with the support of the PingCAP Incubator plan, we hope to build the TiBigData project together with the community and strive to enhance TiDB’s big data capabilities.

TiBigData项目为TiDB提供了Presto和Flink的只读支持。将来，在PingCAP孵化器计划的支持下，我们希望与社区一起构建TiBigData项目，并努力增强TiDB的大数据功能。

Originally published at www.pingcap.com on July 24, 2020

最初于 2020年7月24日 发布在 www.pingcap.com 上