We have to represent every bit of data in numerical values to be processed and analyzed by machine learning and deep learning models. However, strings do not usually come in a nice and clean format and require a lot preprocessing.
我们必须以数值表示数据的每一位,以便通过机器学习和深度学习模型进行处理和分析。 但是,字符串通常不会采用简洁的格式,并且需要大量预处理。
Pandas provides numerous functions and methods to process textual data. In this post, we will focus on data types for strings rather than string operations. Using appropriate data types is the first step to make most out of Pandas. There are currently two data types for textual data, object and StringDtype.
熊猫提供了多种功能和方法来处理文本数据。 在本文中,我们将重点介绍字符串的数据类型,而不是字符串操作。 使用适当的数据类型是充分利用Pandas的第一步。 当前,文本数据有两种数据类型: object和StringDtype。
Before pandas 1.0, only “object” datatype was used to store strings which cause some drawbacks because non-string data can also be stored using “object” datatype. Pandas 1.0 introduces a new datatype specific to string data which is StringDtype. As of now, we can still use object or StringDtype to store strings but in the future, we may be required to only use StringDtype.
在pandas 1.0之前,仅使用“对象”数据类型来存储字符串,这会导致一些缺点,因为非字符串数据也可以使用“对象”数据类型来存储。 Pandas 1.0引入了特定于字符串数据的新数据类型StringDtype 。 到目前为止,我们仍然可以使用object或StringDtype来存储字符串,但是将来,可能会要求我们仅使用StringDtype。
One important thing to note here is that object datatype is still the default datatype for strings. To use StringDtype, we need to explicitly state it.
这里要注意的一件事是对象数据类型仍然是字符串的默认数据类型。 要使用StringDtype,我们需要明确声明它。
We can pass “string” or pd.StringDtype() argument to dtype parameter to select string datatype.
我们可以将“ string ”或pd.StringDtype()参数传递给dtype参数以选择字符串数据类型。
We can also convert from “object” to “string” data type using astype function:
我们还可以使用astype函数将“ object”数据类型转换为“ string”数据类型:
Although the default type is “object”, it is recommended to use “string” for a few reasons.
尽管默认类型为“对象”,但出于一些原因,建议使用“字符串”。
- Object data type has a broader scope and allows to store pretty much anything. Thus, even if we have non-strings in a place that is supposed to be a string, we don’t get any error. 对象数据类型的范围更广,可以存储几乎所有内容。 因此,即使我们在应该是字符串的地方放置了非字符串,也不会出现任何错误。
- It is always better to have a dedicated data type. For instance, if we try to the example above with “string” data type, we get a TypeError. 最好使用专用的数据类型。 例如,如果我们尝试上面的示例使用“字符串”数据类型,则会得到TypeError。
Having a dedicated data type allows for data type specific operations. For instance, we cannot use select_dtypes to choose only text columns if “object” data type is used. Select_dtypes(include=”object”) will return any column with object data type. On the other hand, if we use “string” data type for textual data, select_dtypes(include=”string”) will give just what we need.
具有专用数据类型允许进行特定于数据类型的操作。 例如,如果使用“对象”数据类型,则不能使用select_dtypes仅选择文本列。 Select_dtypes(include =“ object”)将返回任何具有对象数据类型的列。 另一方面,如果我们对文本数据使用“字符串”数据类型,则select_dtypes(include =“ string”)会满足我们的需求。
“String” data type is not superior to “object” in terms of performance as of now. However, it is expected, with future enhancements, the performance of “string” data type will be increased and the memory consumption will be decreased. Thus, we should already be using “string” instead of “object” for textual data.
到目前为止,就性能而言,“字符串”数据类型并不优于“对象”。 但是,可以预料,随着将来的增强,“字符串”数据类型的性能将得到提高,内存消耗将减少。 因此,我们应该已经在文本数据中使用“字符串”而不是“对象”。
Thank you for reading. Please let me know if you have any feedback.
感谢您的阅读。 如果您有任何反馈意见,请告诉我。
翻译自: https://towardsdatascience.com/why-we-need-to-use-pandas-new-string-dtype-instead-of-object-for-textual-data-6fd419842e24
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392143.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!