公司广告业务需求,需要多个维度统计每个应用的设备数,点击率,展示率,等相关数据,而且数据需要进行去重,我第一时间想到的是利用clickhouse来做统计,因为我们平台访问量比较大,用mysql可能不太适合
首先我建了四个表
#点击数据表
CREATE TABLE raw_click
(`Date` Date,`Time` DateTime,`Hour` Int8,`AdvertiserID` UInt32 DEFAULT 0,`AdsID` UInt32 DEFAULT 0,`DeveloperID` UInt32 DEFAULT 0,`WebID` UInt32 DEFAULT 0,`FeeTypeID` UInt32 DEFAULT 0,`AdvType` UInt8 DEFAULT 0,`GroupID` UInt32 DEFAULT 0,`PlatformID` UInt32 DEFAULT 0,`PlatformNameID` UInt8 DEFAULT 0,`MaterialId` UInt32 DEFAULT 0,`DeviceID` Nullable(String) DEFAULT NULL,`AppOs` UInt8 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(Date)
ORDER BY Date
SETTINGS index_granularity = 8192#填充数表
CREATE TABLE raw_fill
(`Date` Date,`Time` DateTime,`Hour` Int8,`AdvertiserID` UInt32 DEFAULT 0,`AdsID` UInt32 DEFAULT 0,`DeveloperID` UInt32 DEFAULT 0,`WebID` UInt32 DEFAULT 0,`FeeTypeID` UInt32 DEFAULT 0,`AdvType` UInt8 DEFAULT 0,`GroupID` UInt32 DEFAULT 0,`PlatformID` UInt32 DEFAULT 0,`PlatformNameID` UInt8 DEFAULT 0,`MaterialId` UInt32 DEFAULT 0,`DeviceID` Nullable(String) DEFAULT NULL,`AppOs` UInt8 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(Date)
ORDER BY Date
SETTINGS index_granularity = 8192#请求数表
CREATE TABLE raw_request
(`Date` Date,`Time` DateTime,`Hour` Int8,`AdvertiserID` UInt32 DEFAULT 0,`AdsID` UInt32 DEFAULT 0,`DeveloperID` UInt32 DEFAULT 0,`WebID` UInt32 DEFAULT 0,`FeeTypeID` UInt32 DEFAULT 0,`AdvType` UInt8 DEFAULT 0,`GroupID` UInt32 DEFAULT 0,`PlatformID` UInt32 DEFAULT 0,`PlatformNameID` UInt8 DEFAULT 0,`MaterialId` UInt32 DEFAULT 0,`DeviceID` Nullable(String) DEFAULT NULL,`AppOs` UInt8 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(Date)
ORDER BY Date
SETTINGS index_granularity = 8192#展示数表
CREATE TABLE raw_show
(`Date` Date,`Time` DateTime,`Hour` Int8,`AdvertiserID` UInt32 DEFAULT 0,`AdsID` UInt32 DEFAULT 0,`DeveloperID` UInt32 DEFAULT 0,`WebID` UInt32 DEFAULT 0,`FeeTypeID` UInt32 DEFAULT 0,`AdvType` UInt8 DEFAULT 0,`GroupID` UInt32 DEFAULT 0,`PlatformID` UInt32 DEFAULT 0,`PlatformNameID` UInt8 DEFAULT 0,`MaterialId` UInt32 DEFAULT 0,`DeviceID` Nullable(String) DEFAULT NULL,`AppOs` UInt8 DEFAULT 1
)
ENGINE = MergeTree
PARTITION BY toYYYYMMDD(Date)
ORDER BY Date
SETTINGS index_granularity = 8192
当时建表时,我犹豫了两个方面,第一需不需要进行按月分表,然后我问了chatgpt
翻译过来的意思就是《你硬件的极限才是我clickhouse的极限》,那我就放心把数据往里面塞了
犹豫的第二点就是,我要不要只建一个表,将点击展示填充这些行为用type区分。后来仔细思考了一下,还是觉得每个行为进行一次分表是最好的
数据表里的每个字段,都将是我们业务报表,需要进行维度查询的条件,所以数据库就这样定下来了。
接下来就是需要考虑怎么将数据插入进来,我这里只分享一下我的插入数据脚本
#!/usr/local/php/bin/php -q
<?php
declare(ticks=1);const _TOUCHER_NAME_ = "ch_stat";#同步器的名称// 如果存在开发环境配置,则加载
include("int/clickhouse1.3.10/Clickhouse.php");
include("int/config.php");$mq_name = $argv[1] ?? '';
if (empty($mq_name)) {exit("不是正确的打开方式!");
}
$table_name_arr = ['raw_show_mq' => 'raw_show','raw_click_mq' => 'raw_click','raw_fill_mq' => 'raw_fill','raw_request_mq' => 'raw_request'
];
$table_name = $table_name_arr[$mq_name] ?? '';
if (empty($table_name)) {exit("不是正确的打开方式啊!");
}
#监听断开信号
$handle = true;
pcntl_signal(SIGTERM, 'handleSignal');
pcntl_signal(SIGINT, 'handleSignal');
pcntl_signal(SIGQUIT, 'handleSignal');#链接redis
$redisconn = redis_conn();
$redisconn->select(9);$clickhouse = new Clickhouse($ch_config, '数据库表名');while (true) {if (date("H") == '05' && date("i") == '00' && date("s") == '00') {exit(_TOUCHER_NAME_ . ":I am gone away");}$start_time = microtime_float(); //记录开始时间try {$queueLen = $redisconn->lLen($mq_name);} catch (\Exception $e) {# 预防redis 挂掉exit(_TOUCHER_NAME_ . ": redis gone away ");}#暂时一次插入1000$queue_count = 1000;$data = [];if ($queueLen < $queue_count) {#数据不够 我在等等$queue_count = $queueLen;
// msg2log(_TOUCHER_NAME_ . ":数据不够 我在等等!");
// sleep(3);
// continue;}for ($i = 0; $i < $queue_count; $i++) {#取出队列的数据$json_data = $redisconn->rPop($mq_name);if (empty($json_data)) {#会有为空吗continue;}#组装数据插入$data[] = json_decode($json_data, true);}if (empty($data)) {msg2log(_TOUCHER_NAME_ . ":队列暂时没有可消耗数据!");sleep(5);continue;}#批量插入try {$clickhouse->insert($table_name, $data);} catch (Exception $exception) {#批量插入失败 全部推回去msg2log(_TOUCHER_NAME_ . ":批量插入失败,将数据推回去");foreach ($data as $v) {#数据结构有问题 可暂时先注释$redisconn->lPush($mq_name, json_encode($v));}#清空数据$data = [];#排除是不是clickhouse挂了if (!$clickhouse->alive()) {exit("clickhouse 链接异常 尝试退出重连!");}}$end_time = microtime_float();if (!$handle) {msg2log(_TOUCHER_NAME_ . ":程序主动退出!Using Time " . ($end_time - $start_time) . " Sec, Totoal touched :" . count($data));break;}msg2log(_TOUCHER_NAME_ . ": Using Time " . ($end_time - $start_time) . " Sec, Totoal touched :" . count($data));sleep(3);
}function handleSignal($signal)
{global $handle;switch ($signal) {case SIGTERM:case SIGINT:case SIGQUIT:$handle = false;#exit;// 处理其他信号...}
}?>
脚本的内容,主要就是从队列里面拿到数据插入到clickhouse里面去,然后里面加了一点检测redis,clickhouse是否断开的判断处理,以及当数据存在异常时,将数组从新推回队列,防止数据丢失,最后一点就是当我们断掉脚本的时候,检测信号,将数据整理完毕之后再断开,这样尽可能的避免数据的丢失
插入数据脚本没问题了之后,等到数据进来,发现数据增长的是真的快,这是跑了2个多月的数据,因为平台流量大,导致数据很多,虽然查询起来有没有问题,但是我发现每次执行sql,时间大约在一个四五秒左右(以下面这段sql为例)
SELECT Date, SUM(dau) AS dau, SUM(request) AS request, SUM(fill) AS fill, SUM(show) AS show, SUM(click) AS click
FROM (SELECT Date, count(distinct DeviceID) AS dau, count(*) AS request, 0 AS fill, 0 AS show, 0 AS click FROM raw_request WHERE PlatformNameID > 0 AND Date BETWEEN '2024-03-07' AND '2024-03-13' GROUP BY DateUNION ALLSELECT Date, 0 AS dau, 0 AS request, count(*) AS fill, 0 AS show, 0 AS click FROM raw_fill WHERE PlatformNameID > 0 AND Date BETWEEN '2024-03-07' AND '2024-03-13' GROUP BY DateUNION ALLSELECT Date, 0 AS dau, 0 AS request, 0 AS fill, count(*) AS show, 0 AS click FROM raw_show WHERE PlatformNameID > 0 AND Date BETWEEN '2024-03-07' AND '2024-03-13' GROUP BY DateUNION ALLSELECT Date, 0 AS dau, 0 AS request, 0 AS fill, 0 AS show, count(*) AS click FROM raw_click WHERE PlatformNameID > 0 AND Date BETWEEN '2024-03-07' AND '2024-03-13' GROUP BY Date
) AS subquery
GROUP BY Date
ORDER BY Date DESC;
后面我发现其实,之前的历史数据,基本上都用不到,另外一直存着这些数据,备份起来,担心磁盘不够用,所以我想着只保存前面一个月的数据,因为我的数据存储是按天分区的,所以我删除的时候也要按天来删,注意删之后一定要归档一份,删除语句主要用到的是
ALTER TABLE table DROP PARTITION date
日期的格式是20240303 这样的,删除之后,发现数据查询确实也是会快一点,后面再慢慢优化
ALTER TABLE table ADD column 字段名 UInt8 DEFAULT 默认值;
clickhouse目的是为了存储更多的信息,尽量扩展到每一个我们可能会用到的查询条件,如果忘记了,那么我们就需要新增字段,新增字段还是比较快的,上亿条数据执行这段sql,一秒不到,个人猜测可能跟他的列式存储方式有关
最后,这是我个人的一个经验分享,欢迎大家交流学习,也希望能对你有帮助。