RADOS object with short name
上一篇博文,我们将介绍了对象相关的数据结构ghobject_t,以及对象在底层文件系统存储的文件名,以及如何从文件名对应到 ghobject_t对象。
映射关系如下图所示:
这里面有一个漏洞,即object name的长度,如果object name长度太长,超过了本地文件系统所能支持的最长长度怎么办?
cephfs
对于cephfs而言,对象的名字是这样的:
root@185node:/var/share/ezfs/shareroot/bean_nas# dd if=/dev/zero of=bean bs=1M count=8
8+0 records in
8+0 records out
8388608 bytes (8.4 MB) copied, 0.00768172 s, 1.1 GB/s
root@185node:/var/share/ezfs/shareroot/bean_nas# cephfs bean map
WARNING: This tool is deprecated. Use the layout.* xattrs to query and modify layouts.FILE OFFSET OBJECT OFFSET LENGTH OSD0 10000000022.00000000 0 4194304 04194304 10000000022.00000001 0 4194304 1
对于cephfs中某个文件的对象是有两个部分组成的:inode 和 文件内object index, 10000000022是文件的inode,而小数点后的数字,表明的对象是文件中的第几个对象。
root@185node:/var/share/ezfs/shareroot/bean_nas# ll -li
total 8192
1099511627809 drwxrwxrwx 1 root root 1 May 29 11:39 ./
1099511627776 drwxrwxrwx 1 root root 2 May 29 11:01 ../
1099511627810 -rw-r--r-- 1 root root 8388608 May 29 11:39 bean
root@185node:/var/share/ezfs/shareroot/bean_nas# printf "%x\n" 1099511627810
10000000022
root@185node:/var/share/ezfs/shareroot/bean_nas#
对于这种情况下,objectname是很规整的,长度是有限的,我们去底层查看对象在底层文件系统的文件:
注解:bean是pool的名字,下同root@185node:/var/log# ceph osd map bean 10000000022.00000001
osdmap e44 pool 'bean' (15) object '10000000022.00000001' -> pg 15.b5ce59c5 (15.1c5) -> up ([1], p1) acting ([1], p1)root@185node:/data/osd.1/current/15.1c5_head# ll
total 4240
drwxr-xr-x 2 root root 4096 May 29 11:39 ./
drwxr-xr-x 3983 root root 135168 May 29 10:18 ../
-rw-r--r-- 1 root root 4194304 May 29 11:39 10000000022.00000001__head_B5CE59C5__f
-rw-r--r-- 1 root root 0 May 29 10:18 __head_000001C5__f
不出我们预料,对象在底层文件系统的文件名一上来就是对象的名字:10000000022.00000001
RBD
rbd的情况也类似,我们不妨创建一个rbd:
root@185node:/# rbd create -p bean --image-format=2 --size 100 rbd_test
root@185node:/# rbd -p bean ls
rbd_test
root@185node:/# rbd -p bean info rbd_test
rbd image 'rbd_test':size 102400 kB in 25 objectsorder 22 (4096 kB objects)used objects: 0block_name_prefix: rbd_data.6n2q5cs0j0o53format: 2features: layering
root@185node:/#
root@185node:/# rbd -p bean map rbd_test
/dev/rbd0
注意block_name_prefix的前缀,是rbd 内对象的前缀,我们从bean 这个pool中可以找到 rbd_data.6n2q5cs0j0o53.0000000000000000 这个objcet name,总体来讲,对象名也很规整,长度有限。
我们深入到底层文件系统,找到给对象的本地文件名,也是以object name 作为文件名的起始部分:rbd\udata.6n2q5cs0j0o53.0000000000000000
root@185node:/# ceph osd map bean rbd_data.6n2q5cs0j0o53.0000000000000000
osdmap e44 pool 'bean' (15) object 'rbd_data.6n2q5cs0j0o53.0000000000000000' -> pg 15.715f761a (15.21a) -> up ([1], p1) acting ([1], p1)root@185node:/data/osd.1/current/15.21a_head# ll
total 148
drwxr-xr-x 2 root root 4096 May 29 11:57 ./
drwxr-xr-x 3983 root root 135168 May 29 10:18 ../
-rw-r--r-- 1 root root 0 May 29 10:18 __head_0000021A__f
-rw-r--r-- 1 root root 513 May 29 11:57 rbd\udata.6n2q5cs0j0o53.0000000000000000__head_715F761A__f
但是RADOS object是cephfs 和 RBD的基石,RADOS是支持比较长的object name的,如下面的commit所说,RADOS是支持长达2048字节的对象名的。
commit 7e0aca18a04a3848af77f5dd2093dc2e009386ec
Author: Sage Weil <sage@redhat.com>
Date: Wed Jul 16 14:17:27 2014 -0700osd: add config for osd_max_object_name_len = 2048 (was hard-coded at 4096)Previously we had a hard coded limit of 4096. Objects > 3k crash the OSDwhen running on ext4, although they probably work on xfs. But rgw onlygenerates objects a bit over 1024 bytes (maybe 1200 tops?), so let set amore reasonable limit here. 2048 is a nice round number and should besafe.Add a test.Fixes: #8174Signed-off-by: Sage Weil <sage@redhat.com>
但是很不幸,本地文件系统并没有这么强悍,支持的文件名长度都有限:
FS | max filename length in bytes |
---|---|
EXT4 | 255 |
XFS | 255 |
ZFS | 255 |
btrfs | 255 |
这就必然带来问题,因为文件尚且不能存放下object name,更谈不上其他hash之类的字段。Ceph是如何破解这个难题的呢?
RADOS object with long name
我们不妨通过rados 命令创建一个具有超长 object name的对象:
首先我们产生一个随机的足够长的名字:
root@185node:/# xxd -l $((2048/2)) -p /dev/urandom | tr -d '\n'
acda7ad8b034a90f9b980be5ed47e242209061c2515c2021b83f1f8c49d018d621a14043a68be64ecec025a1434f040c853b7419c0c571c6b20a5e4a25fe7bf2ff181b60508622bf89f7818add55022ba17d6c9f8bd2938d97788964d0da8405a29d5fa77b07b6e4484b5335b20c9e6eb2f89bf7e805256185b580f075008815cad96f79893599b3718d0dbc05796238c2cf22cd4ee0fadc3891951bbffb0602f3b14b3af7b1efe4c96a340de12fa3ba3f4baeb166768326cfe6d79ee210228266f292bdce01eb6d5c6eb4c64ac619d1aa3853d65a614e109638bf7e04389c8b9a06b41492e65a187abc834bfd6fc4988a55c9b2ed5b91a129acf572d6661fa1cac6ce4fb181b005883b38ca600e9004244fb6ff13cde1939c54583a3dc284cd82a6f77ee171a7b7423b040fc6a65070a6ff98a8b45fd3b1de8c325e6ec00c18d077ea6442b9b134fb9d515ea51427ef8dc43bb524c0a2e6958092186e1e3ae6058b114a5d7abfd7056e55596336f9191269731b71c240e1a449b4a83094fe5d5fe2143bcb19a0f913fb4a836f317a32cf74f91b1091b1c16644b39e0ec4dbfc6ec31f9a1da6c2e6c457e976e709b68c921f630fda53185ddc5c9454a63966b5982bc0905a84f134ee7e6187b9e2cd63b4a0fb174bf626c62400517cfb6121df951b3e0e895c1c2c1bd20dc73231f91e2d692c38d2f02f91158c824104c148d08c0ac2e363d7811d964a5fa6415a477e9ac2b304b51e66c52d7ec5d3214bd5f96044a0b96fe6e29a76b2e7818a41ff50db3ebc11eade7089e03237fcb913b17c5ff6de04278ffd7754c62951e493b4044ee916dce246898724a1306c6eae97a689dc9df3f69b42aae6071b00140a8a5d09e67b732c5f093eefc7ca719a7a6d3e5f53f9a36f8a4c9a9e28d19854559f911e1b42ef66ec1a5126ee2adb1d14dc10504a6c00063babea88c1c2b6e97581f771a099388a12d1050a6fe26cba538517195ed399053bd29467422064d8f6dd0661efa9e08f432c0f8ecf42bc589fa357547dc9313da0b172514d4aa102b8a6e01f0205e3c36db2102a7788924d6d314beff379c55d9dc433520355947f4da74038b4f263d74629cac1fa1248b4a89ced59a9005b667f3923b28bb80081429baf8a2748f3f84f31213b660046c22329cf1d3de4f2636be1257c0c8de15cc945f901db2243192802c92162fffef4eee3d4f5aeb9228291d6b89df6ef7c495f9041c65e386a8d77d3ba4b6bc19f0d049d07a49ca95deac3242d0ae8f643df4c65eae119f73516da42e17f8a06b9ea17e1bf248a50b57b870be2cf2269314534a17e77fc0266e05651169a0be11328371dd426d72cb51fa7e1ab5f75f55c0db9453824eeaaa1e156b5c0e0ba27e1f2f99b0733b2b6f004f8dd9f41321b6c24d36ccda327cadc85d97132878c40bb03252cb0
其次,我们通过rados 命令,创建一个该名字的object,对象的内容是“hello world”
root@185node:/# rados --pool=bean put acda7ad8b034a90f9b980be5ed47e242209061c2515c2021b83f1f8c49d018d621a14043a68be64ecec025a1434f040c853b7419c0c571c6b20a5e4a25fe7bf2ff181b60508622bf89f7818add55022ba17d6c9f8bd2938d97788964d0da8405a29d5fa77b07b6e4484b5335b20c9e6eb2f89bf7e805256185b580f075008815cad96f79893599b3718d0dbc05796238c2cf22cd4ee0fadc3891951bbffb0602f3b14b3af7b1efe4c96a340de12fa3ba3f4baeb166768326cfe6d79ee210228266f292bdce01eb6d5c6eb4c64ac619d1aa3853d65a614e109638bf7e04389c8b9a06b41492e65a187abc834bfd6fc4988a55c9b2ed5b91a129acf572d6661fa1cac6ce4fb181b005883b38ca600e9004244fb6ff13cde1939c54583a3dc284cd82a6f77ee171a7b7423b040fc6a65070a6ff98a8b45fd3b1de8c325e6ec00c18d077ea6442b9b134fb9d515ea51427ef8dc43bb524c0a2e6958092186e1e3ae6058b114a5d7abfd7056e55596336f9191269731b71c240e1a449b4a83094fe5d5fe2143bcb19a0f913fb4a836f317a32cf74f91b1091b1c16644b39e0ec4dbfc6ec31f9a1da6c2e6c457e976e709b68c921f630fda53185ddc5c9454a63966b5982bc0905a84f134ee7e6187b9e2cd63b4a0fb174bf626c62400517cfb6121df951b3e0e895c1c2c1bd20dc73231f91e2d692c38d2f02f91158c824104c148d08c0ac2e363d7811d964a5fa6415a477e9ac2b304b51e66c52d7ec5d3214bd5f96044a0b96fe6e29a76b2e7818a41ff50db3ebc11eade7089e03237fcb913b17c5ff6de04278ffd7754c62951e493b4044ee916dce246898724a1306c6eae97a689dc9df3f69b42aae6071b00140a8a5d09e67b732c5f093eefc7ca719a7a6d3e5f53f9a36f8a4c9a9e28d19854559f911e1b42ef66ec1a5126ee2adb1d14dc10504a6c00063babea88c1c2b6e97581f771a099388a12d1050a6fe26cba538517195ed399053bd29467422064d8f6dd0661efa9e08f432c0f8ecf42bc589fa357547dc9313da0b172514d4aa102b8a6e01f0205e3c36db2102a7788924d6d314beff379c55d9dc433520355947f4da74038b4f263d74629cac1fa1248b4a89ced59a9005b667f3923b28bb80081429baf8a2748f3f84f31213b660046c22329cf1d3de4f2636be1257c0c8de15cc945f901db2243192802c92162fffef4eee3d4f5aeb9228291d6b89df6ef7c495f9041c65e386a8d77d3ba4b6bc19f0d049d07a49ca95deac3242d0ae8f643df4c65eae119f73516da42e17f8a06b9ea17e1bf248a50b57b870be2cf2269314534a17e77fc0266e05651169a0be11328371dd426d72cb51fa7e1ab5f75f55c0db9453824eeaaa1e156b5c0e0ba27e1f2f99b0733b2b6f004f8dd9f41321b6c24d36ccda327cadc85d97132878c40bb03252cb0 <(echo "hello,world")
通过ceph osd map 命令,找到该对象的所在的OSD :
-> pg 15.5939415b (15.15b) -> up ([0], p0) acting ([0], p0)
我们去本地文件系统去寻找该对象对应的文件:
root@185node:/data/osd.0/current/15.15b_head# ll
total 152
drwxr-xr-x 2 root root 4096 May 29 10:38 ./
drwxr-xr-x 3961 root root 135168 May 29 10:18 ../
-rw-r--r-- 1 root root 12 May 29 10:38 acda7ad8b034a90f9b980be5ed47e242209061c2515c2021b83f1f8c49d018d621a14043a68be64ecec025a1434f040c853b7419c0c571c6b20a5e4a25fe7bf2ff181b60508622bf89f7818add55022ba17d6c9f8bd2938d97788964d0da8405a29d5fa77b07b6e4484b5335b20c9e6eb2f_8293d87c929eba91a280_0_long
-rw-r--r-- 1 root root 0 May 29 10:18 __head_0000015B__f
很明显,本地文件系统是不可能存放下长度达1K这么长的名字的,那ceph是怎么做的呢?对于长的object name,ceph是如何处理的呢?
从存储在本地文件系统的名字来看,文件名分成4个部分
- object name prefix ,长度为FILENAME_PREFIX_LEN
- object name 的 SHA-1 hash,注意是完整object name的SHA-1 hash
- candidate index , 调用lfn_get_name函数时传递的参数值
- FILENAME_COOKIE 静态字符串,就是‘long’ 这个字符串。
这四个部分通过下划线_分隔开。
这部分逻辑时在build_filename函数实现的:
void LFNIndex::build_filename(const char *old_filename, int i, char *filename, int len)
{char hash[FILENAME_HASH_LEN + 1];assert(len >= FILENAME_SHORT_LEN + 4);strncpy(filename, old_filename, FILENAME_PREFIX_LEN);filename[FILENAME_PREFIX_LEN] = '\0';if ((int)strlen(filename) < FILENAME_PREFIX_LEN)return;if (old_filename[FILENAME_PREFIX_LEN] == '\0')return;hash_filename(old_filename, hash, sizeof(hash));int ofs = FILENAME_PREFIX_LEN;while (1) {int suffix_len = sprintf(filename + ofs, "_%s_%d_%s", hash, i, FILENAME_COOKIE.c_str());if (ofs + suffix_len <= FILENAME_SHORT_LEN || !ofs)break;ofs--;}
}
这部分逻辑比较简单,如果old_filename 即原始的object name长度有限,比FILENAME_PREFIX_LEN 要短的话,那就说明时短的对象名,什么处理也不用做,直接将名字赋值给filename 即可。 但是如果old_filename 很长,就要计算名字的hash,组成长的文件名,即上面提到的4段式。
#define CEPH_CRYPTO_SHA1_DIGESTSIZE 20class LFNIndex : public CollectionIndex {/// Hash digest output size.static const int FILENAME_LFN_DIGEST_SIZE = CEPH_CRYPTO_SHA1_DIGESTSIZE;/// Length of filename hash.static const int FILENAME_HASH_LEN = FILENAME_LFN_DIGEST_SIZE;/// Max filename size.static const int FILENAME_MAX_LEN = 4096;/// Length of hashed filename.static const int FILENAME_SHORT_LEN = 255;/// Length of hashed filename prefix.static const int FILENAME_PREFIX_LEN;/// Length of hashed filename cookie.static const int FILENAME_EXTRA = 4;/// Lfn cookie value.static const string FILENAME_COOKIE;/// Name of LFN attribute for storing full name.static const string LFN_ATTR;/// Prefix for subdir index attributes.static const string PHASH_ATTR_PREFIX;/// Prefix for index subdirectories.static const string SUBDIR_PREFIX;
const int LFNIndex::FILENAME_PREFIX_LEN = FILENAME_SHORT_LEN - FILENAME_HASH_LEN -FILENAME_COOKIE.size() -FILENAME_EXTRA;const string LFNIndex::FILENAME_COOKIE = "long";
有时候需要根据ghoject_t 来生成段的短的文件名:
string LFNIndex::lfn_get_short_name(const ghobject_t &oid, int i)
{string long_name = lfn_generate_object_name(oid);assert(lfn_must_hash(long_name));char buf[FILENAME_SHORT_LEN + 4];build_filename(long_name.c_str(), i, buf, sizeof(buf));return string(buf);
}
因为短的文件名是长的object name的摘要,必然会有数据的损失,因此,需要判断短的文件名和长的文件名是否匹配:
bool LFNIndex::short_name_matches(const char *short_name, const char *cand_long_name)
{const char *end = short_name;while (*end) ++end;const char *suffix = end;if (suffix > short_name) --suffix; // last charwhile (suffix > short_name && *suffix != '_') --suffix; // back to first _if (suffix > short_name) --suffix; // one behind thatwhile (suffix > short_name && *suffix != '_') --suffix; // back to second _int index = -1;char buf[FILENAME_SHORT_LEN + 4];assert((end - suffix) < (int)sizeof(buf));int r = sscanf(suffix, "_%d_%s", &index, buf);if (r < 2)return false;if (strcmp(buf, FILENAME_COOKIE.c_str()) != 0)return false;build_filename(cand_long_name, index, buf, sizeof(buf));return strcmp(short_name, buf) == 0;
}
注意,刚才我提到了,SHA1本质是摘要,如果文件名从2K截断成200+字节,纵然提供了SHA1摘要,也是有数据损失的,如何根据磁盘上的文件重新获取object的所有信息呢。靠文件名肯定是不行了,有数据丢失,而且不可逆,恢复不回来object的所有信息。
ceph采用的xattr。这几天一直想先写ceph的chain_xattr, 但总觉的简单,而且机缘不到。我们先讲述原理,至于xattr,并不复杂。
root@185node:/data/osd.0/current/15.15b_head# getfattr -d acda7ad8b034a90f9b980be5ed47e242209061c2515c2021b83f1f8c49d018d621a14043a68be64ecec025a1434f040c853b7419c0c571c6b20a5e4a25fe7bf2ff181b60508622bf89f7818add55022ba17d6c9f8bd2938d97788964d0da8405a29d5fa77b07b6e4484b5335b20c9e6eb2f_8293d87c929eba91a280_0_long
# file: acda7ad8b034a90f9b980be5ed47e242209061c2515c2021b83f1f8c49d018d621a14043a68be64ecec025a1434f040c853b7419c0c571c6b20a5e4a25fe7bf2ff181b60508622bf89f7818add55022ba17d6c9f8bd2938d97788964d0da8405a29d5fa77b07b6e4484b5335b20c9e6eb2f_8293d87c929eba91a280_0_long
user.ceph.snapset=0sAgIZAAAAAAAAAAAAAAABAAAAAAAAAAAAAAAAAAAAAA==
user.cephos.lfn3="acda7ad8b034a90f9b980be5ed47e242209061c2515c2021b83f1f8c49d018d621a14043a68be64ecec025a1434f040c853b7419c0c571c6b20a5e4a25fe7bf2ff181b60508622bf89f7818add55022ba17d6c9f8bd2938d97788964d0da8405a29d5fa77b07b6e4484b5335b20c9e6eb2f89bf7e805256185b580f075008815cad96f79893599b3718d0dbc05796238c2cf22cd4ee0fadc3891951bbffb0602f3b14b3af7b1efe4c96a340de12fa3ba3f4baeb166768326cfe6d79ee210228266f292bdce01eb6d5c6eb4c64ac619d1aa3853d65a614e109638bf7e04389c8b9a06b41492e65a187abc834bfd6fc4988a55c9b2ed5b91a129acf572d6661fa1cac6ce4fb181b005883b38ca600e9004244fb6ff13cde1939c54583a3dc284cd82a6f77ee171a7b7423b040fc6a65070a6ff98a8b45fd3b1de8c325e6ec00c18d077ea6442b9b134fb9d515ea51427ef8dc43bb524c0a2e6958092186e1e3ae6058b114a5d7abfd7056e55596336f9191269731b71c240e1a449b4a83094fe5d5fe2143bcb19a0f913fb4a836f317a32cf74f91b1091b1c16644b39e0ec4dbfc6ec31f9a1da6c2e6c457e976e709b68c921f630fda53185ddc5c9454a63966b5982bc0905a84f134ee7e6187b9e2cd63b4a0fb174bf626c62400517cfb6121df951b3e0e895c1c2c1bd20dc73231f91e2d692c38d2f02f91158c824104c148d08c0ac2e363d7811d964a5fa6415a477e9ac2b304b51e66c52d7ec5d3214bd5f96044a0b96fe6e29a76b2e7818a41ff50db3ebc11eade7089e03237fcb913b17c5ff6de04278ffd7754c62951e493b4044ee916dce246898724a1306c6eae97a689dc9df3f69b42aae6071b00140a8a5d09e67b732c5f093eefc7ca719a7a6d3e5f53f9a36f8a4c9a9e28d19854559f911e1b42ef66ec1a5126ee2adb1d14dc10504a6c00063babea88c1c2b6e97581f771a099388a12d1050a6fe26cba538517195ed399053bd29467422064d8f6dd0661efa9e08f432c0f8ecf42bc589fa357547dc9313da0b172514d4aa102b8a6e01f0205e3c36db2102a7788924d6d314beff379c55d9dc433520355947f4da74038b4f263d74629cac1fa1248b4a89ced59a9005b667f3923b28bb80081429baf8a2748f3f84f31213b660046c22329cf1d3de4f2636be1257c0c8de15cc945f901db2243192802c92162fffef4eee3d4f5aeb9228291d6b89df6ef7c495f9041c65e386a8d77d3ba4b6bc19f0d049d07a49ca95deac3242d0ae8f643df4c65eae119f73516da42e17f8a06b9ea17e1bf248a50b57b870be2cf2269314534a17e77fc0266e05651169a0be11328371dd426d72cb51fa7e1ab5f75f55c0db9453824eeaaa1e156b5c0e0ba27e1f2f99b0733b2b6f004f8dd9f41321b6c24d36ccda327cadc85d97132878c40bb03252cb0"
user.cephos.lfn3@1="__head_5939415B__f"
user.cephos.spill_out=0sMQA=root@185node:/data/osd.0/current/15.15b_head#
注意该短文件名对应的文件有扩展属性信息:
- user.cephos.lfn3
- user.cephos.lfn3@1
ceph将object 所有需要的信息都存放在 user.cephos.lfn$INDEX_VERSION 这个扩展属性里面。 但是为什么冒出来个user.cephos.lfn3@1, 这就是chain_xattr的含义了。2个Linux 扩展属性信息存放的是一笔扩展属性,仅仅是因为EXT4这个本地文件系统扩展属性中value能存放的数据非常有限 2K,没有办法将value存放在单个key对应的 扩展属性里面,所以使用多个key来描述一个属性。这就是chain_xattr中chain的含义。
即如果你希望存放一个key value到Linux文件系统的某个文件的扩展属性中,受限于扩展属性能容纳的value长度有限,你不得不这么存放:
key key@1 key@2 key@3
OK,都讲完了,还是有一些代码需要梳理,先到此处吧。我也累了。