SAP BTP Hyperscaler PostgreSQL都有哪些Performance监控 (一)

前言

SAP BTP云平台中，除了自身的HANA数据库作为首选以外，它还支持PostgreSQL的整套服务，并以PaaS的形式提供给客户。你可以按照实例为单位进行购买申请不同标准规格的PG实例，然后构建自己的业务逻辑。Hyperscaler是这套产品或方案的代号名称。大概意为超大型有弹性、可扩展可伸缩。

SAP BTP本身并不对PostgreSQL进行深度定制，只是对原始的来自AWS、Azure、GCP甚至AC (Ali Cloud)的PostgreSQL RDS做了进一步的抽象和整体集成，让用户对PG的管理更能自动化，降低使用难度，并且力图达到较好的SLA。

那么它都有哪些有用的性能监控指标呢？

分析介绍

1、Hyperscaler PostgreSQL的服务规格

SAP BTP云平台上的监控这一块，采取了与dynatrace全面集成的一体化方案，数据库这一块基本上也是如此。

cf marketplace -s postgresql-db
Getting service plan information for service postgresql-db as my.user@*.com...
OKservice plan   description                                            free or paid
development    PostgreSQL service offering for development purposes   paid
standard       Standard PostgreSQL service offering                   paid
premium        Premium PostgreSQL service offering                    paid

典型的，我们可以看到，PostgreSQL-db服务，提供了下边三种规格的plan。相当于三种由低到高的规格。

Service Plan (Marketplace)	Associated Entitlements	Compute Resources	Storage Disk
development	development	`1 vCPU + 2GB RAM` (default, non-configurable)	`20GB` (non-configurable)
standard	standard, storage and/or storage_ha	`1 vCPU + 2GB RAM` (default), `2 vCPU + 4GB RAM`	Configurable in blocks of 5GB
premium	premium, storage and/or storage_ha	`2 vCPU + 8GB RAM` (default), `4 vCPU + 16GB RAM`, `8 vCPU + 32GB RAM`, `16 vCPU + 64GB RAM`	Configurable in blocks of 5GB

开发版，它是规定死的，20G的存储，1个core的cpu以及2GB的内存。适用于开发环境。标准版，计算资源只有两档：2GB ram或者4GB 的ram。根据实际需求，可以选择此档。至于premium档，则是可以不断往上调相关配置。CPU与RAM是成比例放大的。

你可以通过命令进行规格的由低往高方向的升级，反之则不行。cf update-service <instance_name> -p <new_plan>

在实例化的时候，还可以提供一些定制化的DB实例本身的参数。那些参数也都是比较有限的。此得不表。

比如：
engine_version， multi_az （用于多地灾备和高可用），locale, backup_retention_period (各IaaS提供商对应的值各不相同)

db_parameters：[]，使用一个数组来定实例级的配置参数。也是比较有限的几个。常见的如：

max_wal_size， checkpoint_timeout， autovacuum_vacuum_scale_factor， autovacuum_max_workers， max_locks_per_transaction， idle_in_transaction_session_timeout。

2、相关监控指标

相关指标现在越来越多，主要是因为Dynatrace的监控能力越来越强。

这里主要分成三大部分：系统指标（由RDS供应商提供并完善）、进程指标（是指PG实例运行过程中的一些指标）、Multi AZ相关指标。

限于篇幅，本文就先介绍Hyperscaler中AWS相关的监控指标。

系统指标：

来自AWS的hyperscaler相关监控指标：

1)、Burst Balance:

这表示实例可用的通用SSD (gp2)突发桶I/O积分的百分比。此度量仅对突发实例类型有效。

Timeseries ID: custom:postgres-db.burst_balance
Metric Key: ext:postgres-db.burst_balance
Unit: Percent
Dimensions: N/A
Source: CloudWatch (AWS/RDS) - BurstBalance
Aggregation Type: AVG

2)、CPU Credit Balance

实例自启动或启动以来累积的CPU积分数。此度量仅对突发实例类型有效。

Timeseries ID: custom:postgres-db.cpu_credit_balance
Metric Key: ext:postgres-db.cpu_credit_balance
Unit: Count (count)
Dimensions: N/A
Source: CloudWatch (AWS/RDS) - CPUCreditBalance
Aggregation Type: AVG

3)、CPU Credit Usage

实例用于CPU利用率的CPU积分数。此度量仅对突发实例类型有效。

Timeseries ID: custom:postgres-db.cpu_credit_usage
Metric Key: ext:postgres-db.cpu_credit_usage
Unit: Count (count)
Dimensions: N/A
Source: CloudWatch (AWS/RDS) - CPUCreditUsage
Aggregation Type: AVG

4)、CPU Utilization

实例的CPU利用率百分比。

Timeseries ID: custom:postgres-db.cpu_utilization
Metric Key: ext:postgres-db.cpu_utilization
Unit: Percent
Dimensions: N/A
Source: CloudWatch (AWS/RDS): CPUUtilization
Aggregation Type: AVG

5）Database Connections

实例正在使用的数据库连接数。这还不包括数据库尚未清理的断开的数据库连接。

Timeseries ID: custom:postgres-db.db_connections
Metric Key: ext:postgres-db.db_connections
Unit: Count (count)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): DatabaseConnections
Aggregation Type: COUNT

6）Disk Queue Depth

等待访问磁盘的未处理I/ o(读/写请求)数量。

Timeseries ID: custom:postgres-db.disk_queue_depth
Metric Key: ext:postgres-db.disk_queue_depth
Unit: Count (count)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): DiskQueueDepth
Aggregation Type: COUNT

7）Memory Utilization

实例上随机访问内存使用的百分比。

Timeseries ID: custom:postgres-db.memory_utilization
Metric Key: ext:postgres-db.memory_utilization
Unit: Percent
Dimensions: N/A
Source: CloudWatch (AWS/RDS): FreeableMemory
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量以字节为单位提供了可用内存。度量处理器需要进行计算，将其转换为实例使用的内存百分比。公式如下:

memory_utilization = ((instance_memory - freeable_memory)/instance_memory) x 100

8）Disk Utilization

实例上存储空间使用的百分比。

Timeseries ID: custom:postgres-db.disk_utilization
Metric Key: ext:postgres-db.disk_utilization
Unit: Percent
Dimensions: N/A
Source: CloudWatch (AWS/RDS): DiskUtilization
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量以字节为单位提供了可用存储空间。度量处理器需要进行计算，将其转换为实例使用的内存百分比。公式如下:

disk_utilization = ((instance_disk - free_storage_space)/instance_disk) x 100

9）Free Storage Space

实例上可用的空闲存储空间。

Timeseries ID: custom:postgres-db.free_storage_space
Metric Key: ext:postgres-db.free_storage_space
Unit: MegaByte (MB)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): FreeStorageSpace
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量以字节为单位提供了可用存储空间。指标处理器需要计算将其转换为兆字节。公式如下:

free_storage = (free_storage_space / (1024 * 1024))

10）Maximum Used Transaction IDs

目前用到的最大的事务ID值。

Timeseries ID: custom:postgres-db.max_used_txn_ids
Metric Key: ext:postgres-db.max_used_txn_ids
Unit: Count (count)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): MaximumUsedTransactionIDs
Aggregation Type: AVG

11）Network Receive Throughput

DB实例上的传入(接收)网络流量，包括客户数据库流量和用于监视和复制的Amazon RDS流量。

Timeseries ID: custom:postgres-db.network_receive_throughput
Metric Key: ext:postgres-db.network_receive_throughput
Unit: MegaBytePerSecond (MB/s)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): NetworkReceiveThroughput
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量提供了以字节/秒为单位的吞吐量。指标处理器需要计算将其转换为兆字节。公式如下:

network_receive_throughput = (network_receive_throughput_in_bytes/(1024 x 1024))

12）Network Transmit Throughput

DB实例上的传出(传输)网络流量，包括客户数据库流量和用于监视和复制的Amazon RDS流量。

Timeseries ID: custom:postgres-db.network_transmit_throughput
Metric Key: ext:postgres-db.network_transmit_throughput
Unit: MegaBytePerSecond (MB/s)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): NetworkTransmitThroughput
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量提供了以字节/秒为单位的吞吐量。指标处理器需要计算将其转换为兆字节。公式如下:

network_transmit_throughput = (network_transmit_throughput_in_bytes/(1024 x 1024))

13）Oldest Replication Slot Lag

就收到的预写日志(write-ahead log, WAL)数据而言，该副本的滞后大小的最大值。

Timeseries ID: custom:postgres-db.oldest_replication_slot_lag
Metric Key: ext:postgres-db.oldest_replication_slot_lag
Unit: MegaByte (MB)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): OldestReplicationSlotLag
Aggregation Type: AVG

14）Read IOPs

每秒磁盘读I/O操作的平均次数。

Timeseries ID: custom:postgres-db.read_iops
Metric Key: ext:postgres-db.read_iops
Unit: PerSecond (count/s)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): ReadIOPS
Aggregation Type: AVG

15）Read Latency

每个磁盘I/O操作所花费的平均时间。

Timeseries ID: custom:postgres-db.read_latency
Metric Key: ext:postgres-db.read_latency
Unit: Seconds
Dimensions: N/A
Source: CloudWatch (AWS/RDS): ReadLatency
Aggregation Type: AVG

16）Read Throughput

每秒从磁盘读取的平均字节数。

Timeseries ID: custom:postgres-db.read_latency
Metric Key: ext:postgres-db.read_latency
Unit: MegaBytePerSecond (MB/s)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): ReadThroughput
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量提供了以字节/秒为单位的吞吐量。指标处理器需要计算将其转换为兆字节。公式如下:

read_throughput = (read_throughput_in_bytes/(1024 x 1024))

17）Replication Slot Disk Usage

复制槽占用的磁盘空间。

Timeseries ID: custom:postgres-db.replication_slot_disk_usage
Metric Key: ext:postgres-db.replication_slot_disk_usage
Unit: MegaByte (MB)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): ReplicationSlotDiskUsage
Aggregation Type: AVG

18）Swap Usage

DB实例所占用的Swap空间大小

Timeseries ID: custom:postgres-db.swap_usage
Metric Key: ext:postgres-db.swap_usage
Unit: MegaByte (MB)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): SwapUsage
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量以字节为单位提供吞吐量。指标处理器需要计算将其转换为兆字节。公式如下:

swap_usage = (swap_usage_in_bytes/(1024 x 1024))

19）Transaction Logs Disk Usage

事务日志使用的磁盘空间。

Timeseries ID: custom:postgres-db.txn_log_disk_usage
Metric Key: ext:postgres-db.txn_log_disk_usage
Unit: MegaByte (MB)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): TransactionLogsDiskUsage
Aggregation Type: AVG

20）Transaction Logs Generation

每秒生成的事务日志的大小。

Timeseries ID: custom:postgres-db.txn_log_generation
Metric Key: ext:postgres-db.txn_log_generation
Unit: MegaBytePerSecond (MB/s)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): TransactionLogsGeneration
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量提供了以字节/秒为单位的吞吐量。指标处理器需要计算将其转换为兆字节。公式如下:

txn_log_generation = (txn_log_generation_in_bytes/(1024 x 1024))

21）Write IOPS

磁盘平均每秒写I/O次数。

Timeseries ID: custom:postgres-db.write_iops
Metric Key: ext:postgres-db.write_iops
Unit: PerSecond (count/s)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): WriteIOPS
Aggregation Type: AVG

22）Write Latency

每个磁盘I/O操作所花费的平均时间。

Timeseries ID: custom:postgres-db.write_latency
Metric Key: ext:postgres-db.write_latency
Unit: Second (s)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): WriteLatency
Aggregation Type: AVG

23）Write Throughput

每秒写入磁盘的平均字节数。

Timeseries ID: custom:postgres-db.write_throughput
Metric Key: ext:postgres-db.write_throughput
Unit: MegaBytePerSecond (MB/s)
Dimensions: N/A
Source: CloudWatch (AWS/RDS): WriteThroughput
Aggregation Type: AVG

注意：来自AWS CloudWatch的原始度量提供了以字节/秒为单位的吞吐量。指标处理器需要计算将其转换为兆字节。公式如下:

write_throughput = (write_throughput_in_bytes/(1024 x 1024))

总结

以上列举出来的相关性能指标，都能通过Dynatrace的监控面板直接定制出来。一个来自于Dynatrace的简单的示例如下：

SAP BTP中hyperscaler PG 的 dynatrace指标展示面板示例

这都是在绑定了Dynatrace以后，它的监控系统自动抓取的相关结果。

从这上边的近23个指标，我们也可以学习到如何抓住系统当中最关键的那些性能指标，从而更好的维护系统。

Dynatrace这个专心做APM的厂商，居然能做到上市，它还是有它的独到之处。从以上性能指标，面向DBA相关人员，我们从侧面也可以向自己提问：我们利用已有的知识和技能，能将上述指标一一映射到对应的SQL监控语句吗？

基于本地部署的传统的DBA，可以自己设定监控指标，进行监控，到了云平台，可能厂商都给包办了，你所需要做的就是监控，然后理解其含义，能分析出现的问题并妥善解决，让系统更健康的运转。

想关注所有文章，也可以关注我的公众号：数据库杂记