clickhouse MPPDB数据库--新特性使用示例

clickhouse 新特性:

从clickhouse 22.3至最新的版本24.3.2.23,clickhouse在快速发展中,每个版本都增加了一些新的特性,在数据写入、查询方面都有性能加速。
本文根据clickhouse blog中的clickhouse release blog中,学习并梳理了一些在实际工作中可能用到的新特性。

以下是如何基于docker,如果试用这些新性

docker run -d --name=ch -p 8123:8123 -p 9000:9000 -p 9009:9009 --ulimit nofile=262144:262144 -v D:/ch/latest/external:/external:rw -v  chlatest:/var/lib/clickhouse:rw -v D:/ch/latest/logs:/var/log/clickhouse-server:rw -v D:/ch/latest/etc/clickhouse-server:/etc/clickhouse-server:rw clickhouse/clickhouse-server:24.3.2.23docker exec -it bashclickhouse-client --format_csv_delimiter=','

transform函数

进行字典替换

transform(x, array_from, array_to, default)
transform(T, Array(T), Array(U), U) -> U
transform(x, array_from, array_to)

UK-house-price-dataset.csv

CREATE TABLE uk_price_paid
(price UInt32,date Date,postcode1 LowCardinality(String),postcode2 LowCardinality(String),type Enum8('terraced' = 1, 'semi-detached' = 2, 'detached' = 3, 'flat' = 4, 'other' = 0),is_new UInt8,duration Enum8('freehold' = 1, 'leasehold' = 2, 'unknown' = 0),addr1 String,addr2 String,street LowCardinality(String),locality LowCardinality(String),town LowCardinality(String),district LowCardinality(String),county LowCardinality(String)
)
ENGINE = MergeTree
ORDER BY (postcode1, postcode2, addr1, addr2);INSERT INTO uk_price_paid
WITHsplitByChar(' ', postcode) AS p
SELECTtoUInt32(price_string) AS price,parseDateTimeBestEffortUS(time) AS date,p[1] AS postcode1,p[2] AS postcode2,transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,b = 'Y' AS is_new,transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
);SELECT transform(number, [0, 1, 2, 3, 4, 5, 6, 7, 8, 9], ['zero', 'one', 'two', 'three', 'four', 'five', 'six', 'seven', 'eight', 'nine'], NULL) AS numbers
FROM system.numbers
LIMIT 10

读取文件

可以自动识别文件的类型,推荐字段类型

SELECT * FROM (
WITHsplitByChar(' ', postcode) AS p
SELECTtoUInt32(price_string) AS price,parseDateTimeBestEffortUS(time) AS date,p[1] AS postcode1,p[2] AS postcode2,transform(a, ['T', 'S', 'D', 'F', 'O'], ['terraced', 'semi-detached', 'detached', 'flat', 'other']) AS type,b = 'Y' AS is_new,transform(c, ['F', 'L', 'U'], ['freehold', 'leasehold', 'unknown']) AS duration, addr1, addr2, street, locality, town, district, county
FROM file('UK-house-price-dataset.csv','CSV','uuid_string String, price_string String, time String, postcode String, a String, b String, c String, addr1 String, addr2 String, street String, locality String, town String, district String, county String, d String, e String'
) SETTINGS format_csv_delimiter=','
) LIMIT 2;

自定义函数

根据需要,编写自定义函数

CREATE OR REPLACE TABLE line_changes
(version UInt32,line_change_type Enum('Add' = 1, 'Delete' = 2, 'Modify' = 3),line_number UInt32,line_content String,time datetime default now()
)
ENGINE = MergeTree
ORDER BY time;INSERT INTO default.line_changes (version,line_change_type,line_number,line_content) VALUES
(1, 'Add'   , 1, 'ClickHouse provides SQL'),
(2, 'Add'   , 2, 'with improvements'),
(3, 'Add'   , 3, 'that makes it more friendly for analytical tasks.'),
(4, 'Add'   , 2, 'with many extensions'),
(5, 'Modify', 3, 'and powerful improvements'),
(6, 'Delete', 1, ''),
(7, 'Add'   , 1, 'ClickHouse provides a superset of SQL');-- add a string (str) into an array (arr) at a specific position (pos)
CREATE OR REPLACE FUNCTION add AS (arr, pos, str) -> arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos));-- delete the element at a specific position (pos) from an array (arr)
CREATE OR REPLACE FUNCTION delete AS (arr, pos) -> arrayConcat(arraySlice(arr, 1, pos-1), arraySlice(arr, pos+1));-- replace the element at a specific position (pos) in an array (arr)
CREATE OR REPLACE FUNCTION modify AS (arr, pos, str) -> arrayConcat(arraySlice(arr, 1, pos-1), [str], arraySlice(arr, pos+1));

arrayFold

SELECT arrayFold((acc, v) -> (acc + v), [10, 20, 30],  0::UInt64) AS sum;CREATE OR REPLACE VIEW text_version AS
WITH T1 AS (SELECT arrayZip(groupArray(line_change_type),groupArray(line_number),groupArray(line_content)) as line_opsFROM (SELECT * FROM line_changes WHERE version <= {version:UInt32} ORDER BY version ASC)
)
SELECT arrayJoin(arrayFold((acc, v) -> if(v.'change_type' = 'Add',       add(acc, v.'line_nr', v.'content'),if(v.'change_type' = 'Delete', delete(acc, v.'line_nr'),if(v.'change_type' = 'Modify', modify(acc, v.'line_nr', v.'content'), []))),line_ops::Array(Tuple(change_type String, line_nr UInt32, content String)),[]::Array(String))) as lines
FROM T1;SELECT * FROM text_version(version = 3);

Parallel window functions

窗口函数采用并行计算,性能大幅提升

SELECTcountry,day,max(tempAvg) AS temperature,avg(temperature) OVER (PARTITION BY country ORDER BY day ASC ROWS BETWEEN 5 PRECEDING AND CURRENT ROW) AS moving_avg_temp
FROM noaa
WHERE country != ''
GROUP BYcountry,date AS day
ORDER BYcountry ASC,day ASC

FINAL

基于FINAL及enable_vertical_final,在如下引擎
ReplacingMergeTree、 AggregatingMergeTree引擎中,可以快速查询到最新的数据

SELECTpostcode1,formatReadableQuantity(avg(price))
FROM uk_property_offers FINAL
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3;SELECTpostcode1,formatReadableQuantity(avg(price))
FROM uk_property_offers
GROUP BY postcode1
ORDER BY avg(price) DESC
LIMIT 3
SETTINGS enable_vertical_final = 1;

Variant Type

SET allow_experimental_variant_type=1, use_variant_as_common_type = 1;SELECTmap('Hello', 1, 'World', 'Mark') AS x,toTypeName(x) AS type
FORMAT Vertical;SELECTarrayJoin([1, true, 3.4, 'Mark']) AS value,toTypeName(value)
Row 1:
──────
x:    {'Hello':1,'World':'Mark'}
type: Map(String, Variant(String, UInt8))┌─value─┬─toTypeName(value)─────────────────────┐
1. │ true  │ Variant(Bool, Float64, String, UInt8) │
2. │ true  │ Variant(Bool, Float64, String, UInt8) │
3. │ 3.4   │ Variant(Bool, Float64, String, UInt8) │
4. │ Mark  │ Variant(Bool, Float64, String, UInt8) │└───────┴───────────────────────────────────────┘

字符相似性函数

  • byteHammingDistance: the Hamming distance between two strings or vectors of equal length is the number of positions at which the corresponding symbols are different. In other words, it measures the minimum number of substitutions required to change one string into the other, or equivalently, the minimum number of errors that could have transformed one string into the other. In a more general context, the Hamming distance is one of several string metrics for measuring the edit distance between two sequences. It is named after the American mathematician Richard Hamming.

    • karolin” and “kathrin” is 3.
    • karolin” and “kerstin” is 3.
    • kathrin” and “kerstin” is 4.
    • 0000 and 1111 is 4.
    • 2173896 and 2233796 is 3.
  • editDistance:a way of quantifying how dissimilar two strings (e.g., words) are to one another, that is measured by counting the minimum number of operations required to transform one string into the other.

  • damerauLevenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

  • jaroWinklerSimilarity: a string metric measuring an edit distance between two sequences. It is a variant of the Jaro distance metric

  • levenshteinDistance: a string metric for measuring the edit distance between two sequences. Informally, the Damerau–Levenshtein distance between two words is the minimum number of operations (consisting of insertions, deletions or substitutions of a single character, or transposition of two adjacent characters) required to change one word into the other.

https://clickhouse.com/docs/en/sql-reference/functions/string-functions#dameraulevenshteindistance

CREATE TABLE domains
(`domain` String,`rank` Float64
)
ENGINE = MergeTree
ORDER BY domain;INSERT INTO domains SELECTc2 AS domain,1 / c1 AS rank
FROM url('domains.csv', 'CSV');SELECTdomain,levenshteinDistance(domain, 'facebook.com') AS d1,damerauLevenshteinDistance(domain, 'facebook.com') AS d2,jaroSimilarity(domain, 'facebook.com') AS d3,jaroWinklerSimilarity(domain, 'facebook.com') AS d4
FROM domains
ORDER BY d1 ASC
LIMIT 10 
Query id: 6f499f27-8274-4787-819a-b510322bdce3┌─domain────────┬─d1─┬─d2─┬─────────────────d3─┬─────────────────d4─┐1. │ facebook.com  │  0 │  0 │                  1 │                  1 │2. │ facebonk.com  │  1 │  1 │ 0.8838383838383838 │ 0.9303030303030303 │3. │ fabebook.com  │  1 │  1 │  0.914141414141414 │ 0.9313131313131312 │4. │ facabook.com  │  1 │  1 │ 0.9444444444444443 │  0.961111111111111 │5. │ facobook.com  │  1 │  1 │ 0.8535353535353535 │ 0.8974747474747474 │6. │ facebook1.com │  1 │  1 │ 0.9743589743589745 │ 0.9846153846153847 │7. │ faceook.com   │  1 │  1 │ 0.9722222222222221 │ 0.9833333333333333 │8. │ faacebook.com │  1 │  1 │ 0.9743589743589745 │ 0.9794871794871796 │9. │ faceboock.com │  1 │  1 │ 0.9326923076923077 │ 0.9596153846153846 │
10. │ facebool.com  │  1 │  1 │ 0.9444444444444443 │ 0.9666666666666666 │└───────────────┴────┴────┴────────────────────┴────────────────────┘

Vectorized distance functions

可以作为向量数据库使用,支持L2,cosineDistance,IP三种向量相似度的度量方法

https://clickhouse.com/blog/clickhouse-release-24-02

WITH 'dog' AS search_term,
(SELECT vectorFROM gloveWHERE word = search_termLIMIT 1
) AS target_vector
SELECT word, cosineDistance(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;WITH'dog' AS search_term,(SELECT vectorFROM gloveWHERE word = search_termLIMIT 1) AS target_vector
SELECTword,1 - dotProduct(vector, target_vector) AS score
FROM glove
WHERE lower(word) != lower(search_term)
ORDER BY score ASC
LIMIT 5;

Adaptive asynchronous inserts

Asynchronous inserts shift data batching from the client side to the server side: data from insert queries is inserted into a buffer first and then written to the database storage later or asynchronously respectively.
在这里插入图片描述

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/793289.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

K8S Deployment 简介, 1个简单的Kubernetes Deployment YAML 文件

当谈到 Kubernetes 集群中的应用程序部署和管理时&#xff0c;Deployment、ReplicaSet 和 Pod 是三个重要的概念。它们之间存在一定的关系和层次结构。下面是对 Deployment、ReplicaSet 和 Pod 的详细解释以及它们之间的关系。 Deployment&#xff08;部署&#xff09; Deploy…

js教程(12)——本地储存

一、介绍 前端本地存储是指在浏览器中存储数据的机制&#xff0c;它允许前端开发者将数据保存在用户的浏览器中&#xff0c;以便在用户下次访问网站时可以使用这些数据。 前端本地存储有以下几种方式&#xff1a; Cookie&#xff1a;Cookie 是最早也是最常用的前端本地存储方式…

【Java设计模式】序:设计模式总体概述

目录 什么是设计模式设计模式的分类1 创建型模式1.1. 单例&#xff08;Singleton&#xff09;1.2 原型&#xff08;Prototype&#xff09;1.3 工厂方法&#xff08;FactoryMethod&#xff09;1.4 抽象工厂&#xff08;AbstractFactory&#xff09;1.5 建造者&#xff08;Builde…

31. 下一个排列 —— LeetCode (python) [PS: LeetCode 运行环境疑似出错]

# encoding utf-8 # 开发者&#xff1a;xxx # 开发时间&#xff1a; 20:26 # "Stay hungry&#xff0c;stay foolish."class Solution(object):def nextPermutation(self, nums):import itertoolsl len(nums)a tuple(nums)nums.sort()permutations_lst list(ite…

Android RecycleView 异步缓存 itemView 提升滑动性能

RecyclerView 是 Android 官方推荐的用于展示大量数据列表的控件&#xff0c;具有高度的可定制性和灵活性。我们可以通过自定义 LayoutManager、ItemDecoration、ItemAnimator 等来实现不同的布局和动画效果&#xff0c;满足各种需求。同时&#xff0c;RecyclerView 支持局部刷…

C语言什么是静态变量?如何实现?

一、问题 在编写程序的过程中&#xff0c;对于某些函数的局部变量的值&#xff0c;有时不希望它在函数调⽤结束后消失&#xff0c;也就是不释放该变量所占⽤的存储单元&#xff1b;同样&#xff0c;有时在程序设计中也希望某些外部变量只限于被本⽂件引⽤。这就需要使⽤静态变量…

2024年清明节安装matlab 2024a

下载安装离线支持包SupportSoftwareDownloader_R2024a_win64&#xff0c;地址https://ww2.mathworks.cn/support/install/support-software-downloader.html&#xff0c;运行软件&#xff08;自解压运行&#xff09;&#xff0c;登录账号&#xff08;需要提前在官网注册&#x…

反转链表 - LeetCode 热题 23

大家好&#xff01;我是曾续缘&#x1f497; 今天是《LeetCode 热题 100》系列 发车第 23 天 链表第 2 题 ❤️点赞 &#x1f44d; 收藏 ⭐再看&#xff0c;养成习惯 反转链表 给你单链表的头节点 head &#xff0c;请你反转链表&#xff0c;并返回反转后的链表。 示例 1&#…

时序预测 | Matlab基于CFBP级联前向BP神经网络时序预测

时序预测 | Matlab基于CFBP级联前向BP神经网络时序预测 目录 时序预测 | Matlab基于CFBP级联前向BP神经网络时序预测预测效果基本介绍程序设计参考资料 预测效果 基本介绍 1.Matlab基于CFBP级联前向BP神经网络时序预测&#xff08;完整源码和数据)&#xff1b; 2.数据集为excel…

iHRM人力资源管理系统

iHRM人力资源管理系统 源码和教程都在此 https://www.yuque.com/aslwr/college/bxcq9npncyspgz9t ‍

开源模型应用落地-chatglm3-6b模型小试-入门篇(三)

一、前言 刚开始接触AI时&#xff0c;您可能会感到困惑&#xff0c;因为面对众多开源模型的选择&#xff0c;不知道应该选择哪个模型&#xff0c;也不知道如何调用最基本的模型。但是不用担心&#xff0c;我将陪伴您一起逐步入门&#xff0c;解决这些问题。 在信息时代&#xf…

5.111 BCC工具之ext4dist.py解读

一,工具简介 ext4dist跟踪ext4的读取、写入、打开和fsync操作,并将其延迟总结为2的幂次方直方图。 二,代码示例 #!/usr/bin/env pythonfrom __future__ import print_function from bcc import BPF from time import sleep, strftime import argparse# symbols kallsyms …

路由Vue-Router使用

Vue Router 是 Vue.js 的官方路由。它与 Vue.js 核心深度集成&#xff0c;让用 Vue.js 构建单页应用变得轻而易举。 介绍 | Vue Router (vuejs.org) 1. 安装 npm install vue-router4 查看安装好的vue-router 2. 添加路由 新建views文件夹用来存放所有的页面&#xff0c;在…

7-42 清点代码库

上图转自新浪微博&#xff1a;“阿里代码库有几亿行代码&#xff0c;但其中有很多功能重复的代码&#xff0c;比如单单快排就被重写了几百遍。请设计一个程序&#xff0c;能够将代码库中所有功能重复的代码找出。各位大佬有啥想法&#xff0c;我当时就懵了&#xff0c;然后就挂…

笔记: javaSE day17天笔记

第十七天课堂笔记 Java常用类 数学类★★★ math java.lang.Math , 数学类 round(x) : 四舍五入 , 把 x加0.5 后向下取整 ceil(x) : 返回大于等于x的最小整数 , 向上取整 floor(x) : 返回小于等于x的最大整数 , 向下取整 sqrt(x) : 平方根 cbrt(x): 立方根 pow(a , b)…

LangChain Demo | Agent X ReAct X wikipedia 询问《三体》的主要内容

背景 LangChain学习中&#xff0c;尝试改了一下哈里森和吴恩达课程当中的问题&#xff0c;看看gpt-3.5-turbo在集成了ReAct和wikipedia后&#xff0c;如何回答《三体》的主要内容是什么这个问题&#xff0c;当然&#xff0c;主要是为了回答这问题时LangChain内部发生了什么。所…

基于大型语言模型的智能体(Agent)研究综述--人大

内容概述 论文地址&#xff1a;https://arxiv.org/pdf/2308.11432.pdf 这篇综述内容有35页&#xff0c;内容很多&#xff0c;俗话说一图胜千言&#xff0c;作者提供了5张精美的图片和3个表格&#xff0c;把这些搞明白后对这篇综述也就理解差不多了。文章的总体结构如下由6部分…

基于GaN的半导体光学放大器SOA

摘要 基于GaN的材料可覆盖很宽的光谱范围&#xff0c;以紫外、紫、蓝、绿和红波发射的激光二极管已经商业化。基于GaN的半导体光学放大器&#xff08;SOA&#xff09;具有提高激光二极管输出功率的能力&#xff0c;因此SOA将有很多潜在应用。未来需要利用短波、超快脉冲特性的…

Linux知识点记录

Linux知识点记录 1. 后台运行应用程序方法一&#xff1a;&方法二&#xff1a;nohup & 2. 一个shell脚本中执行多个应用程序3. 2>&14. shell脚本清除日志5. 通过grep查找匹配字符串 1. 后台运行应用程序 参考文章&#xff1a;https://blog.csdn.net/Pan_peter/…

常见的四种限流算法及基础实现

常见的四种限流算法及基础实现 什么是限流有哪些限流算法&#xff1f;限流算法固定窗口滑动窗口漏桶算法令牌算法 什么是限流 限流是对某一时间窗口内的请求数进行限制&#xff0c;保持系统的可用性和稳定性&#xff0c;防止因流量暴增而导致的系统运行缓慢或宕机。 在高并发…