nosql_探索NoSQL系列

nosql

数据科学 (Data Science)

Knowledge on NoSQL databases seems to be an increasing requirement in data science applications, yet, the taxonomy is so diverse and problem-centered that it can be a challenge to grasp them. This post attempts to shed light on some of the concepts, often delving into each design’s specificities.

关于NoSQL数据库的知识似乎已成为数据科学应用程序中日益增长的需求，但是，分类法是如此多样且以问题为中心，以至于难以掌握它们。这篇文章试图阐明一些概念，经常深入研究每种设计的特性。

We start by briefly introducing NoSQL and the reasoning behind its appearance, followed by an analysis of each of the four members of the NoSQL family, their behavior, and main mechanisms, in addition to their advantages, disadvantages, and typical use cases.

我们首先简要介绍NoSQL及其出现的原因，然后分析NoSQL家族的四个成员中的每个成员，其行为和主要机制，以及它们的优点，缺点和典型用例。

Image for post — The NoSQL database family and their representations.

什么是NoSQL？ (What is NoSQL?)

NoSQL (Not-only SQL) came into prominence in the mid-late-2000s as alternatives to traditional SQL. Instigated by the Web 2.0 industry, it allows for horizontal scaling, distributed databases, and flexible models (schema-less design). This paradigm shift means developers can focus more of their time in growing features and less time in database design. Typically, NoSQL solutions are presented as a cost-effective alternative to their SQL counterpart which relaxes RDBM’s rigidity.

NoSQL(非唯一SQL)在2000年代中期开始成为传统SQL的替代方法。在Web 2.0行业的倡导下，它允许水平缩放，分布式数据库和灵活的模型(无模式设计)。这种模式转变意味着开发人员可以将更多的时间投入到不断增长的功能上，而将更少的时间集中在数据库设计上。通常，NoSQL解决方案是其SQL替代方案的一种经济高效的替代方案，从而减轻了RDBM的刚性。

Initially, NoSQL languages focused on key-value models effectively removing the need for SQL hence the name, NoSQL, as an abbreviation of “No SQL Support”. Over time the community realized that each tool filled a specific need, abandoning the “death to SQL” feeling over a coexistence-driven approach, with NoSQL seeing its meaning changing to “Not-only SQL”.

最初，NoSQL语言专注于键值模型，从而有效地消除了对SQL的需求，因此将名称NoSQL缩写为“ No SQL Support”。随着时间的流逝，社区意识到每种工具都满足了特定的需求，并没有采用共存驱动的方法来抛弃“ SQL的死亡”的感觉，而NoSQL则将其含义更改为“非SQL”。

The figure on the left represents the CAP theorem, which states that is is impossible for a distributed data store to simultaneously provide more than two of the three guarantees.

左图代表CAP定理，该定理指出，分布式数据存储不可能同时提供三个保证中的两个以上。

Whereas traditional RDBMs focus on the left side of the diagram, (Consistency and Availability, CA), NoSQL databases allow for horizontal partitioning and speed by sacrificing Consistency in favor of Availability and Partition Tolerance (AP).

传统的RDBM集中在图表的左侧(一致性和可用性，CA)，而NoSQL数据库则通过牺牲一致性来支持可用性和分区容忍(AP)，从而实现了水平分区和速度。

Although some do have ACID compliance, most NoSQL databases revolve around what is known as “Eventual Consistency”, abiding by the BASE properties instead. Eventual consistency stipulates that eventually, all the nodes within the cluster will contain the same data version. Referred to as stale reads, it means that multiple queries may return different results, temporarily, a direct result of relaxing the Consistency guarantee.

尽管有些确实符合ACID标准，但大多数NoSQL数据库都围绕着所谓的“最终一致性”，而是遵循BASE属性。最终的一致性规定最终，集群中的所有节点将包含相同的数据版本。称为过时的读取，意味着多个查询可能暂时返回不同的结果，这是放宽一致性保证的直接结果。

So when to use NoSQL? Simply put: when ACID compliance is not a requirement, fast and cheap scalability is mandatory or your application falls into the Big Data category.

那么何时使用NoSQL？ 简而言之：当不需要ACID合规性时，必须具有快速廉价的可伸缩性，或者您的应用程序属于大数据类别。

Before proceeding, keep in mind that NoSQL design is drastically different from that of its RDBMs counterparts, in the former, data tends to be denormalized and often repeated as storage is considered a cheap commodity when compared with access speed and availability. This paradigm combined with the no consistency guarantee is bound to generate temporary divergences.

在继续之前，请记住NoSQL设计与RDBM的设计完全不同，在前者中，数据趋于非规范化并且经常重复，因为与访问速度和可用性相比，存储被认为是便宜的商品。这种范例与不一致性保证相结合必然会产生暂时的分歧。

Note #1: Several authors consider that NoSQL has clearly failed its goal, paving the way for the accommodation of NoSQL-like features in traditional SQL via NewSQL or Distributed SQL. But let us skip this topic for the time being.

注意事项1：几位作者认为NoSQL显然没有实现其目标，这为通过NewSQL或Distributed SQL在传统SQL中容纳类似NoSQL的功能铺平了道路。但是，让我们暂时跳过此主题。

Note #2: Some Document-Value NoSQL databases allow for ACID-like transactions as long as performed within the same collection, referred to as entity group transactions in Azure Table Storage; MongoDB added support for ACID-like distributed transactions in version 4.2.

注意＃2：某些Document-Value NoSQL数据库允许类似ACID的事务，只要在同一集合中执行，即在Azure表存储中称为实体组事务； MongoDB在版本4.2中添加了对类似于ACID的分布式事务的支持。

Note #3: Some SQL databases are able to combine horizontal sharding/scaling and distributed queries using, for instance, the Citus extension for PostgresSQL.

注意＃3：某些SQL数据库能够使用例如用于PostgresSQL的Citus扩展来组合水平分片/扩展和分布式查询。

键值数据库 (Key-Value Databases)

The simplest flavor of NoSQL databases revolves around the concept of associative arrays, in other words, it simply ties a given key to a record of any type, from a simple string or JSON to video files.

NoSQL数据库的最简单形式围绕着关联数组的概念，换句话说，它简单地将给定键与 任何类型 的记录 (从简单的字符串或JSON到视频文件)联系起来。

Key-value databases are organized into partitions or buckets, which can contain one or several entities, and each record is represented by a unique row key. Records in turn contain one or multiple fields. The concept of partition and row keys is one of the most critical aspects of NoSQL as it allows for logical partitions (defined by the partition key) to be moved around physical partitions (nodes) according to their workload. At the same time, its one of the major hindrances for newcomers: selecting the wrong key will spell disaster should your queries not take advantage of the selected key, and once you’ve selected it, there is no turning back.

键值数据库组织为分区或存储桶，可以包含一个或几个实体，每个记录由唯一的行键表示。记录又包含一个或多个字段。分区键和行键的概念是NoSQL的最关键方面之一，因为它允许逻辑分区(由分区键定义)根据工作负载在物理分区(节点)周围移动。同时，这也是新来者的主要障碍之一：如果您的查询没有利用所选的键，那么选择错误的键将带来灾难，一旦选择，就无法回头。

The choice of the partition key and row key is particularly challenging for key-value databases given you can only query via its key. What does this mean? In key-value terms, a JSON can be represented by a horizontal diagram with the id field representing the record’s key which is linked to the JSON’s values.

分区键和行键的选择对于键值数据库特别具有挑战性，因为您只能通过其键查询。这是什么意思？用键值术语，JSON可以用水平图表示，其中id字段表示记录的键，该键链接到JSON的值。

As long as you know the partition and record keys, CRUD operations are blazing fast. But what if you want to retrieve all cases where the “name” field equals “Michal”? Unlike in document databases, fields are not automatically indexed, which means that such a query will require going through every record to see if it contains the “name” field and it is equal to “Michal” — analogous to a full table scan in SQL. I hope this clarifies how important row key selection is, and why its important to know the system’s purpose beforehand. Of course, these systems are not designed to be field-queried, but we’re merely stating their limitations as a general-purpose database.

只要您知道分区和记录键，CRUD操作就可以快速进行。但是，如果要检索“名称”字段等于“ Michal”的所有情况怎么办？与文档数据库中的字段不同，字段不会自动建立索引，这意味着此类查询将需要遍历每条记录以查看其是否包含“名称”字段并且等于“ Michal”，这与SQL中的全表扫描类似。我希望这可以弄清行键选择的重要性，以及为什么事先知道系统目的很重要。当然，这些系统并非旨在进行现场查询，而只是在说明它们作为通用数据库的局限性。

A typical key would be anything that can be considered to be unique, customerID, supplierID, sessionID, etc. Most developers opt for using composite keys, for instance, UserID.session would store session data; UserID.user in turn would represent the user’s cached information, simplifying the value for single fields instead of a dictionary-like structure, speeding up CRUD operations.

典型的密钥可以是任何可以视为唯一的密钥， customerID ， supplierID ， sessionID等 。大多数开发人员选择使用复合键，例如， UserID.session将存储会话数据。 UserID.user依次表示用户的缓存信息，从而简化了单个字段的值，而不是类似字典的结构，从而加快了CRUD操作。

How about partition keys? In the beginning, we stated that partition keys allow for the system to move partitions around according to their workload. Hence the obvious and most direct consequence of improper partition key selection is creating hot partitions — groups of entities that are frequently accessed but can’t be distributed given they operate as a single unit (a single partition). With this in mind, a partition could be the UserID, whilst the keys would represent different user attributes, UserID.Name, UserID.Location, UserID.Height. If a physical partition contains several hot logical partitions (e.g. a set of users that request data frequently) the engine is able to distribute the logical partitions spreading the workload across different clusters. Without an appropriate partition key, the logical partition may be unary thus impossible to distribute.

分区键如何？首先，我们说过分区键允许系统根据其工作负荷来移动分区。因此，分区键选择不正确的最明显和最直接的结果就是创建了热分区-经常访问但不能作为实体运行(因为它们作为一个单元(一个分区)运行)的实体组。考虑到这一点，分区可以是UserID ，而键则代表不同的用户属性UserID.Name ， UserID.Location ， UserID.Height 。如果物理分区包含多个热逻辑分区(例如，一组频繁请求数据的用户)，则引擎能够分配逻辑分区，从而将工作负载分散到不同的群集中。没有适当的分区键，逻辑分区可能是一元的，因此无法分发。

Another important feature of key-value databases is the fact that their records can have a time to live (TTL), an automatic expiration date that can be controlled. This makes them strong candidates for session-driven storage.

键值数据库的另一个重要特征是它们的记录可以有生存时间(TTL)，这是可以控制的自动到期日期。这使它们成为会话驱动存储的理想候选者。

Advantages

优点

Scalability
可扩展性
Very fast read speed
读取速度非常快
Simple and flexible data model
简单灵活的数据模型

Disadvantages

缺点

No relationship between entities
实体之间没有关系
No transaction-like behavior
没有类似交易的行为
Only per-key queries are supported
仅支持每键查询
No support for retrieving multiple keys at once
不支持一次检索多个密钥
Slow multiple updates and collection scans
减慢多个更新和集合扫描的速度

Vendors

供应商

Redis
雷迪斯
Riak
里亚克
Memcached
记忆快取
Azure Table Store
Azure表存储

Use cases

用例

Web session cache
Web会话缓存
Store user preferences
存储用户首选项

文件资料库 (Document Databases)

Document Store databases build on the concept behind key-value, extending it to support complex multi-layered objects named documents. This seemingly simple difference has several consequences: because the engine is familiar with the multi-level or nested concept, all fields, including nested fields, are indexed allowing for specific field query and selection; documents can be queried against themselves but not between each other; denormalized design patterns allow for direct data retrieval without requiring joins; some document store databases implement ACID-like properties.

Document Store数据库建立在键值背后的概念上，并将其扩展为支持名为文档的复杂多层对象。这种看似简单的差异会带来多种后果：由于引擎熟悉多级或嵌套概念，因此对所有字段(包括嵌套字段)都进行了索引，以允许进行特定的字段查询和选择；可以查询自己的文件，但不能互相查询；非规范化设计模式允许直接数据检索而无需联接；一些文档存储数据库实现类似于ACID的属性。

Each document can be thought of as a row in a relational model only the former’s schema is not predefined not the object type needs to be equal.

可以将每个文档视为关系模型中的一行，只有前者的架构未预定义，而对象类型不需要相等。

Like its key-value predecessor, document databases are also structured into collections which in turn, contain partitions and their nested entities. The same caution on selecting partition and document (not row) keys are mandatory. However, as stated above, the values in a document database are automatically indexed allowing them to be queried. Nonetheless, the query’s efficiency will still be drastically hindered by querying fields.

像其键值前身一样，文档数据库也被构造为集合，这些集合又包含分区及其嵌套实体。在选择分区和文档(非行)键时，同样的警告是必须的。但是，如上所述，文档数据库中的值会自动建立索引，以便对其进行查询。但是，查询字段仍然会严重阻碍查询的效率。

Note that due to this document-oriented structure, unlike its key-value counterpart, in document store databases there isn’t a strong need to create composite keys since per-field access is integrated within the engine.

请注意，由于这种面向文档的结构(与键值对应物不同)，在文档存储数据库中，由于在引擎中集成了按字段访问，因此不需要创建复合键。

The following diagram showcases the performance obtained per query type in Azure Cosmos DB, although extensible to a document store database. The diagram flows from the most efficient, point query, to the less efficient, table scans.

下图展示了Azure Cosmos DB中按查询类型获得的性能，尽管可以扩展到文档存储数据库。该图从效率最高的点查询到效率较低的表扫描。

At the start of this primer we stipulated that NoSQL databases are located in the AP spectrum of the CAP theorem. In practice, due to continuous developments, some NoSQL databases have the ability to switch between CAP guarantees, allowing the developer to create tailored solutions.

在本入门的开始，我们规定NoSQL数据库位于CAP定理的AP频谱中。实际上，由于不断的发展，某些NoSQL数据库具有在CAP保证之间切换的能力，从而使开发人员可以创建量身定制的解决方案。

Advantages

优点

Scalability for complex objects
复杂对象的可伸缩性
Document-oriented data model, JSON or XML allows for complex and schema-less structure
面向文档的数据模型，JSON或XML允许复杂且无模式的结构
Supports queries and joins within the document
支持查询和文档内的联接
Data modeling paradigm allows storing all the data in a single document
数据建模范例允许将所有数据存储在一个文档中
Fast read and writes
快速读写

Disadvantages

缺点

Data modeling paradigm leads to having data duplicated among documents
数据建模范例导致数据在文档之间重复
Complex design leads to inconsistency
复杂的设计导致不一致

Vendors

供应商

MongoDB
MongoDB
Azure Document DB
Azure文档数据库
AWS Dynamo DB
AWS Dynamo数据库
OrientDB
东方数据库
CouchDB
CouchDB

Use cases:

用例：

Social networks
社交网络
eCommerce
电子商务
Anything in which you can relax ACID compliance
任何可以放松ACID合规性的地方

列族数据库 (Column Family Databases)

Column family databases share concepts with both RDBMS and key-value stores. You can think of the rows as keys in a key-value store, and the columns as the value. Their optimal use case is for large data ingestion or data analytics, being suitable for storing billions of rows and tens of thousands of columns.

列族数据库与RDBMS和键值存储共享概念。您可以将行视为键值存储中的键，将列视为值。它们的最佳用例是用于大数据摄取或数据分析，适合存储数十亿行和数万列。

The data model behind the column means it can efficiently handle sparse matrices, a major hindrance to traditional RDBMS. Whereas in the latter all columns must be filled — recall NULL is a value and the occupied space corresponds to the column’s type — occupying storage, the former does not, in fact, it only stores existing values per column, for each row (or key).

该列后面的数据模型意味着它可以有效地处理稀疏矩阵，这是传统RDBMS的主要障碍。在后一种情况下，必须填充所有列-回忆NULL是一个值，并且已占用的空间对应于该列的类型-占用存储空间，实际上前者并没有为每行(或键)仅存储每列的现有值)。

In the previous figure, notice how the keys “John Lennon” and “Paul McCartney” do not have a value set for the founded column, where you see a gap is actually a non-existing column. You can think of the columns as a variable-length collection — e.g. a Dictionary in Python — where each item is optional. In such a case, the first row could be represented in Python as:

在上图中，请注意键“ John Lennon”和“ Paul McCartney”是如何没有为founded列设置值的，在该列中您实际上看到的是一个不存在的空白。您可以将这些列视为可变长度的集合(例如，Python中的Dictionary)，其中每个项目都是可选的。在这种情况下，第一行可以用Python表示为：

As expected, just like in the other members of the NoSQL family, the choice of partition key plays a critical role, as it dictates what is stored contiguously and what can be broken into smaller chunks in addition to speeding up the index-based queries.

正如预期的那样，就像在NoSQL系列的其他成员中一样，分区键的选择起着至关重要的作用，因为它决定了连续存储的内容以及除加速基于索引的查询之外还可以分解成较小的块的内容。

Advantages

优点

Scalability
可扩展性
Fast write/read
快速写入/读取

Disadvantages

缺点

Update/modification operations are slow
更新/修改操作缓慢

Vendors

供应商

Cassandra
卡桑德拉
HBase
HBase的
Google BigTable
Google BigTable
Druid
德鲁伊

Use cases:

用例：

Telemetry
遥测
IoT
物联网
Reporting
报告中

图数据库 (Graph Databases)

The last NoSQL database types shift the focus onto the relationship between entities. Entities, such as users, are represented by nodes, whereas the connections between entities dictate how they are related.

最后一种NoSQL数据库类型将重点转移到实体之间的关系上。诸如用户之类的实体由节点表示，而实体之间的连接决定了它们之间的关系。

Graph databases store the relationship information within each node, by doing so, the need for lookup operations needed in relational databases is removed, saving much-needed resources.

图形数据库将关系信息存储在每个节点内，这样就消除了对关系数据库中所需的查找操作的需求，从而节省了急需的资源。

The graph database on the other hand, as the relations are pre-computed, queries are faster as they do not require a lookup.

另一方面，由于图形数据库是预先计算的关系，因此查询更快，因为它们不需要查找。

Vendors

供应商

Neo4j
Neo4j
OrientDB
东方数据库
ArangoDB
ArangoDB

Use cases:

用例：

Knowledge Graphs
知识图
Identity Graphs
身份图
Fraud Detection
欺诈识别
Recommendation Engines
推荐引擎
Social Networks
社交网络

结论 (Conclusion)

Like in most technologies, it is the developer’s responsibility to select the tool suited for the problem at hand. The first pillar is always the same: make sure you understand the business problem at hand before delving into the technology.

与大多数技术一样，开发人员有责任选择适合当前问题的工具。第一个Struts始终是相同的：在研究技术之前，请确保您了解手头的业务问题。

Hopefully, this post helped you in said goal!

希望这篇文章对您达成目标有所帮助！

Next up: NewSQL and Distributed SQL, thanks for reading!

接下来：NewSQL和分布式SQL，感谢您的阅读！

翻译自: https://medium.com/towards-artificial-intelligence/exploring-the-nosql-family-49e9f23313ad

nosql

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/388543.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！

C++TCP和UDP属于传输层协议

TCP和UDP属于传输层协议。其中TCP提供IP环境下的数据可靠传输，它事先为要发送的数据开辟好连接通道（三次握手），然后再进行数据发送；而UDP则不为IP提供可靠性，一般用于实时的视频流传输，像rtp、r…

程序员如何利用空闲时间挣零花钱

一： 私活作为一名程序员，在上班之余，我们有大把的时间，不能浪费，这些时间其实都是可以用来挖掘自己潜在的创造力，今天要讨论的话题就是，程序员如何利用空余时间挣零花钱？比如说周末…

编写程序乘法口诀表C语言,陈广川问:c语言编程九九乘法口诀表怎样用c语言写九九乘法口诀表？...

怎样用c语言写九九乘法口诀表？哈哈，我刚刚用javascript写好乘法口诀表。C语言，如何编写程序输出九九乘法表。形式如下 ********* ******** ******* ****** ***** **** *** ** *？两个循环，一般用for循环一个循环控制行…

PHP中文乱码解决办法

一．首先是PHP网页的编码 1. php文件本身的编码与网页的编码应匹配 a. 如果欲使用gb2312编码，那么php要输出头：header(“Content-Type: text/html; charsetgb2312")，静态页面添加<meta http-equiv"Content-T…

python中api_通过Python中的API查找相关的工作技能

python中api工作技能世界 (The World of Job Skills) So you want to figure out where your skills fit into today’s job market. Maybe you’re just curious to see a comprehensive constellation of job skills, clean and standardized. Or you need a taxonomy of ski…

欺诈行为识别_使用R（编程）识别欺诈性的招聘广告

欺诈行为识别背景 (Background) Online recruitment fraud (ORF) is a form of malicious behaviour that aims to inflict loss of privacy, economic damage or harm the reputation of the stakeholders via fraudulent job advertisements.在线招聘欺诈(ORF)是一种恶意行为…

PE文件的感染C++源代码

PE文件的感染C源代码 PE文件规定了可执行文件的格式，凡是符合此格式的文件都能在windows系统上运行。PE文件的格式暂且不谈，说一些感染PE文件的几种途径。导入表感染。这个涉及比较复杂的操作，首先，要自行写一个dll文件&#x…

c语言实验四报告,湖北理工学院14本科C语言实验报告实验四数组

湖北理工学院14本科C语言实验报告实验四数组.doc实验四数组实验课程名C语言程序设计专业班级 14电气工程2班学号 201440210237 姓名熊帆实验时间 5.12-5.26 实验地点 K4-208 指导教师祁文青一、实验目的和要求1. 掌握一维数组和二维数组的定义、赋值和输入输出的方法&a…

c语言宏定义

一. #define是C语言中提供的宏定义命令，其主要目的是为程序员在编程时提供一定的方便，并能在一定程度上提高程序的运行效率，但学生在学习时往往不能理解该命令的本质，总是在此处产生一些困惑，在编程时误用该命令&#…

rabbitmq channel参数详解【转】

1、Channel 1.1 channel.exchangeDeclare()： type：有direct、fanout、topic三种durable：true、false true：服务器重启会保留下来Exchange。警告：仅设置此选项，不代表消息持久化。即不保证重启后消息还在。原…

感染EXE文件代码(C++)

C代码#include <windows.h> #include <winnt.h> #include <stdio.h> #include <assert.h> #define DEBUG 1 #define EXTRA_CODE_LENGTH 18 #define SECTION_SIZE 0x1000 #define SECTION_NAME ".eViLhsU" #define F…

nlp gpt论文_GPT-3：NLP镇的最新动态

nlp gpt论文什么是GPT-3？ (What is GPT-3?) The launch of Open AI’s 3rd generation of the pre-trained language model, GPT-3 (Generative Pre-training Transformer) has got the data science fraternity buzzing with excitement!Open AI的第三代预训练语言…

真实不装| 阿里巴巴新人上路指北

新手上路，总想听听前辈们分享他们走过的路。橙子选取了阿里巴巴合伙人逍遥子（阿里巴巴集团CEO） 、Eric（蚂蚁金服董事长兼CEO）、Judy（阿里巴巴集团CPO）的几段分享，他们是如何看待职场…

小程序学习总结

上个周末抽空了解了一下小程序,现在将所学所感记录以便日后翻看;需要指出的是我就粗略过了下小程序的api了解了下小程序的开发流程以及工具的使用,然后写了一个小程序的demo;在我看来,如果有前端基础学习小程序无异于锦上添花了,而我这个三年的码农虽也写过不少前端代码但离专业…

tomcat java环境配置

jsp 环境变量配置一、配置JDK 首先，从Sun网站上下载jdk。双击jdk-1_5_0_04-windows-i586-p.exe开始安装，默认安装到C:/Program Files/Java/jdk1.5.0_04，你也可以更改路径，但要记住最后选择的路径，设置环境变量的时候…

uber 数据可视化_使用R探索您在Uber上的活动：如何分析和可视化您的个人数据历史记录

uber 数据可视化Perhaps, dear reader, you are too young to remember that before, the only way to request a particular transport service such as a taxi was to raise a hand to make a signal to an available driver, who upon seeing you would stop if he was not …