nosql
数据科学 (Data Science)
Knowledge on NoSQL databases seems to be an increasing requirement in data science applications, yet, the taxonomy is so diverse and problem-centered that it can be a challenge to grasp them. This post attempts to shed light on some of the concepts, often delving into each design’s specificities.
关于NoSQL数据库的知识似乎已成为数据科学应用程序中日益增长的需求,但是,分类法是如此多样且以问题为中心,以至于难以掌握它们。 这篇文章试图阐明一些概念,经常深入研究每种设计的特性。
We start by briefly introducing NoSQL and the reasoning behind its appearance, followed by an analysis of each of the four members of the NoSQL family, their behavior, and main mechanisms, in addition to their advantages, disadvantages, and typical use cases.
我们首先简要介绍NoSQL及其出现的原因,然后分析NoSQL家族的四个成员中的每个成员,其行为和主要机制,以及它们的优点,缺点和典型用例。
什么是NoSQL? (What is NoSQL?)
NoSQL (Not-only SQL) came into prominence in the mid-late-2000s as alternatives to traditional SQL. Instigated by the Web 2.0 industry, it allows for horizontal scaling, distributed databases, and flexible models (schema-less design). This paradigm shift means developers can focus more of their time in growing features and less time in database design. Typically, NoSQL solutions are presented as a cost-effective alternative to their SQL counterpart which relaxes RDBM’s rigidity.
NoSQL(非唯一SQL)在2000年代中期开始成为传统SQL的替代方法。 在Web 2.0行业的倡导下,它允许水平缩放,分布式数据库和灵活的模型(无模式设计)。 这种模式转变意味着开发人员可以将更多的时间投入到不断增长的功能上,而将更少的时间集中在数据库设计上。 通常,NoSQL解决方案是其SQL替代方案的一种经济高效的替代方案,从而减轻了RDBM的刚性。
Initially, NoSQL languages focused on key-value models effectively removing the need for SQL hence the name, NoSQL, as an abbreviation of “No SQL Support”. Over time the community realized that each tool filled a specific need, abandoning the “death to SQL” feeling over a coexistence-driven approach, with NoSQL seeing its meaning changing to “Not-only SQL”.
最初,NoSQL语言专注于键值模型,从而有效地消除了对SQL的需求,因此将名称NoSQL缩写为“ No SQL Support”。 随着时间的流逝,社区意识到每种工具都满足了特定的需求,并没有采用共存驱动的方法来抛弃“ SQL的死亡”的感觉,而NoSQL则将其含义更改为“非SQL”。
The figure on the left represents the CAP theorem, which states that is is impossible for a distributed data store to simultaneously provide more than two of the three guarantees.
左图代表CAP定理,该定理指出,分布式数据存储不可能同时提供三个保证中的两个以上。
Whereas traditional RDBMs focus on the left side of the diagram, (Consistency and Availability, CA), NoSQL databases allow for horizontal partitioning and speed by sacrificing Consistency in favor of Availability and Partition Tolerance (AP).
传统的RDBM集中在图表的左侧(一致性和可用性,CA),而NoSQL数据库则通过牺牲一致性来支持可用性和分区容忍(AP),从而实现了水平分区和速度。
Although some do have ACID compliance, most NoSQL databases revolve around what is known as “Eventual Consistency”, abiding by the BASE properties instead. Eventual consistency stipulates that eventually, all the nodes within the cluster will contain the same data version. Referred to as stale reads, it means that multiple queries may return different results, temporarily, a direct result of relaxing the Consistency guarantee.
尽管有些确实符合ACID标准,但大多数NoSQL数据库都围绕着所谓的“最终一致性”,而是遵循BASE属性。 最终的一致性规定最终,集群中的所有节点将包含相同的数据版本。 称为过时的读取,意味着多个查询可能暂时返回不同的结果,这是放宽一致性保证的直接结果。
So when to use NoSQL? Simply put: when ACID compliance is not a requirement, fast and cheap scalability is mandatory or your application falls into the Big Data category.
那么何时使用NoSQL? 简而言之:当不需要ACID合规性时,必须具有快速廉价的可伸缩性,或者您的应用程序属于大数据类别。
Before proceeding, keep in mind that NoSQL design is drastically different from that of its RDBMs counterparts, in the former, data tends to be denormalized and often repeated as storage is considered a cheap commodity when compared with access speed and availability. This paradigm combined with the no consistency guarantee is bound to generate temporary divergences.
在继续之前,请记住NoSQL设计与RDBM的设计完全不同,在前者中,数据趋于非规范化并且经常重复,因为与访问速度和可用性相比,存储被认为是便宜的商品。 这种范例与不一致性保证相结合必然会产生暂时的分歧。
Note #1: Several authors consider that NoSQL has clearly failed its goal, paving the way for the accommodation of NoSQL-like features in traditional SQL via NewSQL or Distributed SQL. But let us skip this topic for the time being.
注意事项1:几位作者认为NoSQL显然没有实现其目标,这为通过NewSQL或Distributed SQL在传统SQL中容纳类似NoSQL的功能铺平了道路。 但是,让我们暂时跳过此主题。
Note #2: Some Document-Value NoSQL databases allow for ACID-like transactions as long as performed within the same collection, referred to as entity group transactions in Azure Table Storage; MongoDB added support for ACID-like distributed transactions in version 4.2.
注意#2:某些Document-Value NoSQL数据库允许类似ACID的事务,只要在同一集合中执行,即在Azure表存储中称为实体组事务; MongoDB在版本4.2中添加了对类似于ACID的分布式事务的支持。
Note #3: Some SQL databases are able to combine horizontal sharding/scaling and distributed queries using, for instance, the Citus extension for PostgresSQL.
注意#3:某些SQL数据库能够使用例如用于PostgresSQL的Citus扩展来组合水平分片/扩展和分布式查询。
键值数据库 (Key-Value Databases)
The simplest flavor of NoSQL databases revolves around the concept of associative arrays, in other words, it simply ties a given key to a record of any type, from a simple string or JSON to video files.
NoSQL数据库的最简单形式围绕着关联数组的概念,换句话说,它简单地将给定键与 任何类型 的记录 (从简单的字符串或JSON到视频文件)联系起来。
Key-value databases are organized into partitions or buckets, which can contain one or several entities, and each record is represented by a unique row key. Records in turn contain one or multiple fields. The concept of partition and row keys is one of the most critical aspects of NoSQL as it allows for logical partitions (defined by the partition key) to be moved around physical partitions (nodes) according to their workload. At the same time, its one of the major hindrances for newcomers: selecting the wrong key will spell disaster should your queries not take advantage of the selected key, and once you’ve selected it, there is no turning back.
键值数据库组织为分区或存储桶,可以包含一个或几个实体,每个记录由唯一的行键表示。 记录又包含一个或多个字段。 分区键和行键的概念是NoSQL的最关键方面之一,因为它允许逻辑分区(由分区键定义)根据工作负载在物理分区(节点)周围移动。 同时,这也是新来者的主要障碍之一:如果您的查询没有利用所选的键,那么选择错误的键将带来灾难,一旦选择,就无法回头。
The choice of the partition key and row key is particularly challenging for key-value databases given you can only query via its key. What does this mean? In key-value terms, a JSON can be represented by a horizontal diagram with the id field representing the record’s key which is linked to the JSON’s values.
分区键和行键的选择对于键值数据库特别具有挑战性,因为您只能通过其键查询。 这是什么意思? 用键值术语,JSON可以用水平图表示,其中id字段表示记录的键,该键链接到JSON的值。
As long as you know the partition and record keys, CRUD operations are blazing fast. But what if you want to retrieve all cases where the “name” field equals “Michal”? Unlike in document databases, fields are not automatically indexed, which means that such a query will require going through every record to see if it contains the “name” field and it is equal to “Michal” — analogous to a full table scan in SQL. I hope this clarifies how important row key selection is, and why its important to know the system’s purpose beforehand. Of course, these systems are not designed to be field-queried, but we’re merely stating their limitations as a general-purpose database.
只要您知道分区和记录键,CRUD操作就可以快速进行。 但是,如果要检索“名称”字段等于“ Michal”的所有情况怎么办? 与文档数据库中的字段不同,字段不会自动建立索引,这意味着此类查询将需要遍历每条记录以查看其是否包含“名称”字段并且等于“ Michal”,这与SQL中的全表扫描类似。 我希望这可以弄清行键选择的重要性,以及为什么事先知道系统目的很重要。 当然,这些系统并非旨在进行现场查询,而只是在说明它们作为通用数据库的局限性。
A typical key would be anything that can be considered to be unique, customerID, supplierID, sessionID, etc. Most developers opt for using composite keys, for instance, UserID.session would store session data; UserID.user in turn would represent the user’s cached information, simplifying the value for single fields instead of a dictionary-like structure, speeding up CRUD operations.
典型的密钥可以是任何可以视为唯一的密钥, customerID , supplierID , sessionID等 。 大多数开发人员选择使用复合键,例如, UserID.session将存储会话数据。 UserID.user依次表示用户的缓存信息,从而简化了单个字段的值,而不是类似字典的结构,从而加快了CRUD操作。
How about partition keys? In the beginning, we stated that partition keys allow for the system to move partitions around according to their workload. Hence the obvious and most direct consequence of improper partition key selection is creating hot partitions — groups of entities that are frequently accessed but can’t be distributed given they operate as a single unit (a single partition). With this in mind, a partition could be the UserID, whilst the keys would represent different user attributes, UserID.Name, UserID.Location, UserID.Height. If a physical partition contains several hot logical partitions (e.g. a set of users that request data frequently) the engine is able to distribute the logical partitions spreading the workload across different clusters. Without an appropriate partition key, the logical partition may be unary thus impossible to distribute.
分区键如何? 首先,我们说过分区键允许系统根据其工作负荷来移动分区。 因此,分区键选择不正确的最明显和最直接的结果就是创建了热分区-经常访问但不能作为实体运行(因为它们作为一个单元(一个分区)运行)的实体组。 考虑到这一点,分区可以是UserID ,而键则代表不同的用户属性UserID.Name , UserID.Location , UserID.Height 。 如果物理分区包含多个热逻辑分区(例如,一组频繁请求数据的用户),则引擎能够分配逻辑分区,从而将工作负载分散到不同的群集中。 没有适当的分区键,逻辑分区可能是一元的,因此无法分发。
Another important feature of key-value databases is the fact that their records can have a time to live (TTL), an automatic expiration date that can be controlled. This makes them strong candidates for session-driven storage.
键值数据库的另一个重要特征是它们的记录可以有生存时间(TTL),这是可以控制的自动到期日期。 这使它们成为会话驱动存储的理想候选者。
Advantages
优点
- Scalability 可扩展性
- Very fast read speed 读取速度非常快
- Simple and flexible data model 简单灵活的数据模型
Disadvantages
缺点
- No relationship between entities 实体之间没有关系
- No transaction-like behavior 没有类似交易的行为
- Only per-key queries are supported 仅支持每键查询
- No support for retrieving multiple keys at once 不支持一次检索多个密钥
- Slow multiple updates and collection scans 减慢多个更新和集合扫描的速度
Vendors
供应商
- Redis 雷迪斯
- Riak 里亚克
- Memcached 记忆快取
- Azure Table Store Azure表存储
Use cases
用例
- Web session cache Web会话缓存
- Store user preferences 存储用户首选项
文件资料库 (Document Databases)
Document Store databases build on the concept behind key-value, extending it to support complex multi-layered objects named documents. This seemingly simple difference has several consequences: because the engine is familiar with the multi-level or nested concept, all fields, including nested fields, are indexed allowing for specific field query and selection; documents can be queried against themselves but not between each other; denormalized design patterns allow for direct data retrieval without requiring joins; some document store databases implement ACID-like properties.
Document Store数据库建立在键值背后的概念上,并将其扩展为支持名为文档的复杂多层对象。 这种看似简单的差异会带来多种后果:由于引擎熟悉多级或嵌套概念,因此对所有字段(包括嵌套字段)都进行了索引,以允许进行特定的字段查询和选择; 可以查询自己的文件,但不能互相查询; 非规范化设计模式允许直接数据检索而无需联接; 一些文档存储数据库实现类似于ACID的属性。
Each document can be thought of as a row in a relational model only the former’s schema is not predefined not the object type needs to be equal.
可以将每个文档视为关系模型中的一行,只有前者的架构未预定义,而对象类型不需要相等。
Like its key-value predecessor, document databases are also structured into collections which in turn, contain partitions and their nested entities. The same caution on selecting partition and document (not row) keys are mandatory. However, as stated above, the values in a document database are automatically indexed allowing them to be queried. Nonetheless, the query’s efficiency will still be drastically hindered by querying fields.
像其键值前身一样,文档数据库也被构造为集合,这些集合又包含分区及其嵌套实体。 在选择分区和文档(非行)键时,同样的警告是必须的。 但是,如上所述,文档数据库中的值会自动建立索引,以便对其进行查询。 但是,查询字段仍然会严重阻碍查询的效率。
Note that due to this document-oriented structure, unlike its key-value counterpart, in document store databases there isn’t a strong need to create composite keys since per-field access is integrated within the engine.
请注意,由于这种面向文档的结构(与键值对应物不同),在文档存储数据库中,由于在引擎中集成了按字段访问,因此不需要创建复合键。
The following diagram showcases the performance obtained per query type in Azure Cosmos DB, although extensible to a document store database. The diagram flows from the most efficient, point query, to the less efficient, table scans.
下图展示了Azure Cosmos DB中按查询类型获得的性能,尽管可以扩展到文档存储数据库。 该图从效率最高的点查询到效率较低的表扫描。
At the start of this primer we stipulated that NoSQL databases are located in the AP spectrum of the CAP theorem. In practice, due to continuous developments, some NoSQL databases have the ability to switch between CAP guarantees, allowing the developer to create tailored solutions.
在本入门的开始,我们规定NoSQL数据库位于CAP定理的AP频谱中。 实际上,由于不断的发展,某些NoSQL数据库具有在CAP保证之间切换的能力,从而使开发人员可以创建量身定制的解决方案。
Advantages
优点
- Scalability for complex objects 复杂对象的可伸缩性
- Document-oriented data model, JSON or XML allows for complex and schema-less structure 面向文档的数据模型,JSON或XML允许复杂且无模式的结构
- Supports queries and joins within the document 支持查询和文档内的联接
- Data modeling paradigm allows storing all the data in a single document 数据建模范例允许将所有数据存储在一个文档中
- Fast read and writes 快速读写
Disadvantages
缺点
- Data modeling paradigm leads to having data duplicated among documents 数据建模范例导致数据在文档之间重复
- Complex design leads to inconsistency 复杂的设计导致不一致
Vendors
供应商
- MongoDB MongoDB
- Azure Document DB Azure文档数据库
- AWS Dynamo DB AWS Dynamo数据库
- OrientDB 东方数据库
- CouchDB CouchDB
Use cases:
用例:
- Social networks 社交网络
- eCommerce 电子商务
- Anything in which you can relax ACID compliance 任何可以放松ACID合规性的地方
列族数据库 (Column Family Databases)
Column family databases share concepts with both RDBMS and key-value stores. You can think of the rows as keys in a key-value store, and the columns as the value. Their optimal use case is for large data ingestion or data analytics, being suitable for storing billions of rows and tens of thousands of columns.
列族数据库与RDBMS和键值存储共享概念。 您可以将行视为键值存储中的键,将列视为值。 它们的最佳用例是用于大数据摄取或数据分析,适合存储数十亿行和数万列。
The data model behind the column means it can efficiently handle sparse matrices, a major hindrance to traditional RDBMS. Whereas in the latter all columns must be filled — recall NULL is a value and the occupied space corresponds to the column’s type — occupying storage, the former does not, in fact, it only stores existing values per column, for each row (or key).
该列后面的数据模型意味着它可以有效地处理稀疏矩阵,这是传统RDBMS的主要障碍。 在后一种情况下,必须填充所有列-回忆NULL是一个值,并且已占用的空间对应于该列的类型-占用存储空间,实际上前者并没有为每行(或键)仅存储每列的现有值)。
In the previous figure, notice how the keys “John Lennon” and “Paul McCartney” do not have a value set for the founded column, where you see a gap is actually a non-existing column. You can think of the columns as a variable-length collection — e.g. a Dictionary in Python — where each item is optional. In such a case, the first row could be represented in Python as:
在上图中,请注意键“ John Lennon”和“ Paul McCartney”是如何没有为founded列设置值的,在该列中您实际上看到的是一个不存在的空白。 您可以将这些列视为可变长度的集合(例如,Python中的Dictionary),其中每个项目都是可选的。 在这种情况下,第一行可以用Python表示为:
As expected, just like in the other members of the NoSQL family, the choice of partition key plays a critical role, as it dictates what is stored contiguously and what can be broken into smaller chunks in addition to speeding up the index-based queries.
正如预期的那样,就像在NoSQL系列的其他成员中一样,分区键的选择起着至关重要的作用,因为它决定了连续存储的内容以及除加速基于索引的查询之外还可以分解成较小的块的内容。
Advantages
优点
- Scalability 可扩展性
- Fast write/read 快速写入/读取
Disadvantages
缺点
- Update/modification operations are slow 更新/修改操作缓慢
Vendors
供应商
- Cassandra 卡桑德拉
- HBase HBase的
- Google BigTable Google BigTable
- Druid 德鲁伊
Use cases:
用例:
- Telemetry 遥测
- IoT 物联网
- Reporting 报告中
图数据库 (Graph Databases)
The last NoSQL database types shift the focus onto the relationship between entities. Entities, such as users, are represented by nodes, whereas the connections between entities dictate how they are related.
最后一种NoSQL数据库类型将重点转移到实体之间的关系上。 诸如用户之类的实体由节点表示,而实体之间的连接决定了它们之间的关系。
Graph databases store the relationship information within each node, by doing so, the need for lookup operations needed in relational databases is removed, saving much-needed resources.
图形数据库将关系信息存储在每个节点内,这样就消除了对关系数据库中所需的查找操作的需求,从而节省了急需的资源。
The graph database on the other hand, as the relations are pre-computed, queries are faster as they do not require a lookup.
另一方面,由于图形数据库是预先计算的关系,因此查询更快,因为它们不需要查找。
Vendors
供应商
- Neo4j Neo4j
- OrientDB 东方数据库
- ArangoDB ArangoDB
Use cases:
用例:
- Knowledge Graphs 知识图
- Identity Graphs 身份图
- Fraud Detection 欺诈识别
- Recommendation Engines 推荐引擎
- Social Networks 社交网络
结论 (Conclusion)
Like in most technologies, it is the developer’s responsibility to select the tool suited for the problem at hand. The first pillar is always the same: make sure you understand the business problem at hand before delving into the technology.
与大多数技术一样,开发人员有责任选择适合当前问题的工具。 第一个Struts始终是相同的:在研究技术之前,请确保您了解手头的业务问题。
Hopefully, this post helped you in said goal!
希望这篇文章对您达成目标有所帮助!
Next up: NewSQL and Distributed SQL, thanks for reading!
接下来:NewSQL和分布式SQL,感谢您的阅读!
翻译自: https://medium.com/towards-artificial-intelligence/exploring-the-nosql-family-49e9f23313ad
nosql
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388543.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!