clickhouse MPPDB数据库--新特性使用示例

clickhouse 新特性:

从clickhouse 22.3至最新的版本24.3.2.23,clickhouse在快速发展中,每个版本都增加了一些新的特性,在数据写入、查询方面都有性能加速。
本文根据clickhouse blog中的clickhouse release blog中,学习并梳理了一些在实际工作中可能用到的新特性。

以下是如何基于docker,如果试用这些新性

docker run -d --name=ch -p 8123:8123 -p 9000:9000 -p 9009:9009 --ulimit nofile=262144:262144 -v D:/ch/latest/external:/external:rw -v  chlatest:/var/lib/clickhouse:rw -v D:/ch/latest/logs:/var/log/clickhouse-server:rw -v D:/ch/latest/etc/clickhouse-server:/etc/clickhouse-server:rw clickhouse/clickhouse-server:24.3.2.23docker exec -it bashclickhouse-client --format_csv_delimiter=','

transform函数

进行字典替换

transform(x, array_from, array_to, default)
transform(T, Array(T), Array(U), U) -> U
transform(x, array_from, array_to)

UK-house-price-dataset.csv

CREATE TABLE uk_price_paid
(price UInt32,date Date,postcode1 LowCardinality(String),postcode2 LowCardinality(String),type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),is_new UInt8,duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),addr1 String,addr2 String,street LowCardinality(String),locality LowCardinality(String),town LowCardinality(String),district LowCardinality(String),county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2);INSERT INTO uk_price_paid
WITHsplitByChar(' ', postcode) AS p
SELECTtoUInt32(price_string) AS price,parseDateTimeBestEffortUS(time) AS date,p[1] AS postcode1,p[2] AS postcode2,transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,b = 'Y' AS is_new,transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
);SELECT transform(number, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], NULL) AS numbers
FROM system.numbers
LIMIT 10

读取文件

可以自动识别文件的类型,推荐字段类型

SELECT * FROM (
WITHsplitByChar(' ', postcode) AS p
SELECTtoUInt32(price_string) AS price,parseDateTimeBestEffortUS(time) AS date,p[1] AS postcode1,p[2] AS postcode2,transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,b = 'Y' AS is_new,transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
) SETTINGS format_csv_delimiter=','
) LIMIT 2;

自定义函数

根据需要,编写自定义函数

CREATE OR REPLACE TABLE line_changes
(version UInt32,line_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3),line_number UInt32,line_content String,time datetime default now()
)
ENGINE = MergeTree
ORDER BY time;INSERT INTO default.line_changes (version,line_change_type,line_number,line_content) VALUES
(1, 'Add'   , 1, 'ClickHouse provides SQL'),
(2, 'Add'   , 2, 'with improvements'),
(3, 'Add'   , 3, 'that makes it more friendly for analytical tasks.'),
(4, 'Add'   , 2, 'with many extensions'),
(5, 'Modify', 3, 'and powerful improvements'),
(6, 'Delete', 1, ''),
(7, 'Add'   , 1, 'ClickHouse provides a superset of SQL');-- add a string (str) into an array (arr) at a specific position (pos)
CREATE OR REPLACE FUNCTION add AS (arr, pos, str) -> arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos));-- delete the element at a specific position (pos) from an array (arr)
CREATE OR REPLACE FUNCTION delete AS (arr, pos) -> arrayConcat(arraySlice(arr, 1, pos-1), arraySlice(arr, pos+1));-- replace the element at a specific position (pos) in an array (arr)
CREATE OR REPLACE FUNCTION modify AS (arr, pos, str) -> arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos+1));

arrayFold

SELECT arrayFold((acc, v) -> (acc + v), [10, 20, 30],  0::UInt64) AS sum;CREATE OR REPLACE VIEW text_version AS
WITH T1 AS (SELECT arrayZip(groupArray(line_change_type),groupArray(line_number),groupArray(line_content)) as line_opsFROM (SELECT * FROM line_changes WHERE version <= {version:UInt32} ORDER BY version ASC)
)
SELECT arrayJoin(arrayFold((acc, v) -> if(v.'change_type' = 'Add',       add(acc, v.'line_nr', v.'content'),if(v.'change_type' = 'Delete', delete(acc, v.'line_nr'),if(v.'change_type' = 'Modify', modify(acc, v.'line_nr', v.'content'), []))),line_ops::Array(Tuple(change_type String, line_nr UInt32, content String)),[]::Array(String))) as lines
FROM T1;SELECT * FROM text_version(version = 3);

Parallel window functions

窗口函数采用并行计算,性能大幅提升

SELECTcountry,day,max(tempAvg) AS temperature,avg(temperature) OVER (PARTITION BY country ORDER BY day ASC ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS moving_avg_temp
FROM noaa
WHERE country != ''
GROUP BYcountry,date AS day
ORDER BYcountry ASC,day ASC

FINAL

基于FINAL及enable_vertical_final,在如下引擎
ReplacingMergeTree、 AggregatingMergeTree引擎中,可以快速查询到最新的数据

SELECTpostcode1,formatReadableQuantity(avg(price))
FROM uk_property_offers FINAL
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3;SELECTpostcode1,formatReadableQuantity(avg(price))
FROM uk_property_offers
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3
SETTINGS enable_vertical_final = 1;

Variant Type

SET allow_experimental_variant_type=1, use_variant_as_common_type = 1;SELECTmap('Hello', 1, 'World', 'Mark') AS x,toTypeName(x) AS type
FORMAT Vertical;SELECTarrayJoin([1, true, 3.4, 'Mark']) AS value,toTypeName(value)
Row 1:
──────
x:    {'Hello':1,'World':'Mark'}
type: Map(String, Variant(String, UInt8))┌─value─┬─toTypeName(value)─────────────────────┐
1. │ true  │ Variant(Bool, Float64, String, UInt8) │
2. │ true  │ Variant(Bool, Float64, String, UInt8) │
3. │ 3.4   │ Variant(Bool, Float64, String, UInt8) │
4. │ Mark  │ Variant(Bool, Float64, String, UInt8) │└───────┴───────────────────────────────────────┘

字符相似性函数

  • byteHammingDistance: the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

    • karolin” and “kathrin” is 3.
    • karolin” and “kerstin” is 3.
    • kathrin” and “kerstin” is 4.
    • 0000 and 1111 is 4.
    • 2173896 and 2233796 is 3.
  • editDistance:a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

  • damerauLevenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

  • jaroWinklerSimilarity: a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric

  • levenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

https://clickhouse.com/docs/en/sql-reference/functions/string-functions#dameraulevenshteindistance

CREATE TABLE domains
(`domain` String,`rank` Float64
)
ENGINE = MergeTree
ORDER BY domain;INSERT INTO domains SELECTc2 AS domain,1 / c1 AS rank
FROM url('domains.csv', 'CSV');SELECTdomain,levenshteinDistance(domain, 'facebook.com') AS d1,damerauLevenshteinDistance(domain, 'facebook.com') AS d2,jaroSimilarity(domain, 'facebook.com') AS d3,jaroWinklerSimilarity(domain, 'facebook.com') AS d4
FROM domains
ORDER BY d1 ASC
LIMIT 10 
Query id: 6f499f27-8274-4787-819a-b510322bdce3┌─domain────────┬─d1─┬─d2─┬─────────────────d3─┬─────────────────d4─┐1. │ facebook.com  │  0 │  0 │                  1 │                  1 │2. │ facebonk.com  │  1 │  1 │ 0.8838383838383838 │ 0.9303030303030303 │3. │ fabebook.com  │  1 │  1 │  0.914141414141414 │ 0.9313131313131312 │4. │ facabook.com  │  1 │  1 │ 0.9444444444444443 │  0.961111111111111 │5. │ facobook.com  │  1 │  1 │ 0.8535353535353535 │ 0.8974747474747474 │6. │ facebook1.com │  1 │  1 │ 0.9743589743589745 │ 0.9846153846153847 │7. │ faceook.com   │  1 │  1 │ 0.9722222222222221 │ 0.9833333333333333 │8. │ faacebook.com │  1 │  1 │ 0.9743589743589745 │ 0.9794871794871796 │9. │ faceboock.com │  1 │  1 │ 0.9326923076923077 │ 0.9596153846153846 │
10. │ facebool.com  │  1 │  1 │ 0.9444444444444443 │ 0.9666666666666666 │└───────────────┴────┴────┴────────────────────┴────────────────────┘

Vectorized distance functions

可以作为向量数据库使用,支持L2,cosineDistance,IP三种向量相似度的度量方法

https://clickhouse.com/blog/clickhouse-release-24-02

WITH 'dog' AS search_term,
(SELECT vectorFROM gloveWHERE word = search_termLIMIT 1
) AS target_vector
SELECT word, cosineDistance(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;WITH'dog' AS search_term,(SELECT vectorFROM gloveWHERE word = search_termLIMIT 1) AS target_vector
SELECTword,1 - dotProduct(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;

Adaptive asynchronous inserts

Asynchronous inserts shift data batching from the client side to the server side: data from insert queries is inserted into a buffer first and then written to the database storage later or asynchronously respectively.
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/793289.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

【Java设计模式】序:设计模式总体概述

目录 什么是设计模式设计模式的分类1 创建型模式1.1. 单例&#xff08;Singleton&#xff09;1.2 原型&#xff08;Prototype&#xff09;1.3 工厂方法&#xff08;FactoryMethod&#xff09;1.4 抽象工厂&#xff08;AbstractFactory&#xff09;1.5 建造者&#xff08;Builde…

31. 下一个排列 —— LeetCode (python) [PS: LeetCode 运行环境疑似出错]

# encoding utf-8 # 开发者&#xff1a;xxx # 开发时间&#xff1a; 20:26 # "Stay hungry&#xff0c;stay foolish."class Solution(object):def nextPermutation(self, nums):import itertoolsl len(nums)a tuple(nums)nums.sort()permutations_lst list(ite…

2024年清明节安装matlab 2024a

下载安装离线支持包SupportSoftwareDownloader_R2024a_win64&#xff0c;地址https://ww2.mathworks.cn/support/install/support-software-downloader.html&#xff0c;运行软件&#xff08;自解压运行&#xff09;&#xff0c;登录账号&#xff08;需要提前在官网注册&#x…

反转链表 - LeetCode 热题 23

大家好&#xff01;我是曾续缘&#x1f497; 今天是《LeetCode 热题 100》系列 发车第 23 天 链表第 2 题 ❤️点赞 &#x1f44d; 收藏 ⭐再看&#xff0c;养成习惯 反转链表 给你单链表的头节点 head &#xff0c;请你反转链表&#xff0c;并返回反转后的链表。 示例 1&#…

时序预测 | Matlab基于CFBP级联前向BP神经网络时序预测

时序预测 | Matlab基于CFBP级联前向BP神经网络时序预测 目录 时序预测 | Matlab基于CFBP级联前向BP神经网络时序预测预测效果基本介绍程序设计参考资料 预测效果 基本介绍 1.Matlab基于CFBP级联前向BP神经网络时序预测&#xff08;完整源码和数据)&#xff1b; 2.数据集为excel…

开源模型应用落地-chatglm3-6b模型小试-入门篇(三)

一、前言 刚开始接触AI时&#xff0c;您可能会感到困惑&#xff0c;因为面对众多开源模型的选择&#xff0c;不知道应该选择哪个模型&#xff0c;也不知道如何调用最基本的模型。但是不用担心&#xff0c;我将陪伴您一起逐步入门&#xff0c;解决这些问题。 在信息时代&#xf…

路由Vue-Router使用

Vue Router 是 Vue.js 的官方路由。它与 Vue.js 核心深度集成&#xff0c;让用 Vue.js 构建单页应用变得轻而易举。 介绍 | Vue Router (vuejs.org) 1. 安装 npm install vue-router4 查看安装好的vue-router 2. 添加路由 新建views文件夹用来存放所有的页面&#xff0c;在…

笔记: javaSE day17天笔记

第十七天课堂笔记 Java常用类 数学类★★★ math java.lang.Math , 数学类 round(x) : 四舍五入 , 把 x加0.5 后向下取整 ceil(x) : 返回大于等于x的最小整数 , 向上取整 floor(x) : 返回小于等于x的最大整数 , 向下取整 sqrt(x) : 平方根 cbrt(x): 立方根 pow(a , b)…

LangChain Demo | Agent X ReAct X wikipedia 询问《三体》的主要内容

背景 LangChain学习中&#xff0c;尝试改了一下哈里森和吴恩达课程当中的问题&#xff0c;看看gpt-3.5-turbo在集成了ReAct和wikipedia后&#xff0c;如何回答《三体》的主要内容是什么这个问题&#xff0c;当然&#xff0c;主要是为了回答这问题时LangChain内部发生了什么。所…

基于大型语言模型的智能体(Agent)研究综述--人大

内容概述 论文地址&#xff1a;https://arxiv.org/pdf/2308.11432.pdf 这篇综述内容有35页&#xff0c;内容很多&#xff0c;俗话说一图胜千言&#xff0c;作者提供了5张精美的图片和3个表格&#xff0c;把这些搞明白后对这篇综述也就理解差不多了。文章的总体结构如下由6部分…

基于GaN的半导体光学放大器SOA

摘要 基于GaN的材料可覆盖很宽的光谱范围&#xff0c;以紫外、紫、蓝、绿和红波发射的激光二极管已经商业化。基于GaN的半导体光学放大器&#xff08;SOA&#xff09;具有提高激光二极管输出功率的能力&#xff0c;因此SOA将有很多潜在应用。未来需要利用短波、超快脉冲特性的…

常见的四种限流算法及基础实现

常见的四种限流算法及基础实现 什么是限流有哪些限流算法&#xff1f;限流算法固定窗口滑动窗口漏桶算法令牌算法 什么是限流 限流是对某一时间窗口内的请求数进行限制&#xff0c;保持系统的可用性和稳定性&#xff0c;防止因流量暴增而导致的系统运行缓慢或宕机。 在高并发…

用kimichat批量识别出图片版PDF文件中的文字内容

图片版的PDF文件&#xff0c;怎么才能借助AI工具来提取其中全部的文字内容呢&#xff1f; 第一步&#xff1a;将PDF文件转换成图片格式 具体方法参见文章&#xff1a;《零代码编程&#xff1a;用kimichat将图片版PDF自动批量分割成多个图片》 第二步&#xff1a;识别图片中的…

Dynamo之雪花分形(衍生式设计)

你好&#xff0c;这里是BIM的乐趣&#xff0c;我是九哥~ 今天简单分享一些我收集的Dynamo的雪花分形案例吧&#xff0c;不过多讲解了&#xff0c;有兴趣的小伙伴&#xff0c;可以私信“雪花分形”获取案例文件&#xff0c;下面基本以分享为主&#xff1a; ******多图预警****…

GD32F470_GY-SHT31-D 数字温湿度传感器模块移植

2.11 SHT30温湿度传感器 2.11.1 模块来源 采购链接&#xff1a; GY-SHT31-D 数字温湿度传感器模块 资料下载链接&#xff1a; https://pan.baidu.com/s/1kisMJspcV6Qdr1ye9ElOlQ 2.11.2 规格参数 工作电压&#xff1a;2.4-5.5V 工作电流&#xff1a;0.2~1500uA 温度测量范围&a…

构建未来交通:香橙派OPI Airpro上的智能交通监管系统

引言&#xff1a; 随着城市化进程的加速&#xff0c;交通管理变得越来越复杂。 传统的交通监管系统往往无法有效应对日益增长的车辆数量和复杂的交通状况。因此&#xff0c;我们需要一种更加智能和自适应的解决方案来提高交通效率并减少事故发生率。 香橙派OPI Airpro以其强大的…

ComfyUI ClipSeg插件报错- resize_image出错应该怎么办

上一篇刚介绍了这个插件&#xff0c;结果emm..很快发现事情并不简单...结果又报错了。 后台报错信息&#xff1a; Unused or unrecognized kwargs: padding. !!! Exception during processing !!! Traceback (most recent call last): File "F:\ComfyUI-aki\execution.p…

Open-Sora环境搭建推理测试

引子 Sora&#xff0c;2024年2月15日&#xff0c;OpenAI发布的人工智能文生视频大模型。支持60秒视频生成&#xff0c;震荡了国内国际学术圈、广告圈、AI教培圈。Sora最主要有三个优点&#xff1a;第一&#xff0c;“60s超长视频”&#xff0c;之前文本生成视频大模型一直无法真…

数据库重点知识(个人整理笔记)

目录 1. 索引是什么&#xff1f; 1.1. 索引的基本原理 2. 索引有哪些优缺点&#xff1f; 3. MySQL有哪几种索引类型&#xff1f; 4. mysql聚簇和非聚簇索引的区别 5. 非聚簇索引一定会回表查询吗&#xff1f; 6. 讲一讲前缀索引&#xff1f; 7. 为什么索引结构默认使用B…

【Visual Studio】将项目下的文件夹所有文件随编译自动复制输出到运行目录

要将项目根目录下的文件夹内容输出到运行目录&#xff0c;去处理其中的子文件夹和文件&#xff0c;逐个手动设置文件属性或进行复制显然不是一个可行的方法&#xff0c;因为这既繁琐又低效&#xff0c;那有没有更加高效的方式呢 文章目录 选择文件夹修改配置文件输出文件夹 这里…