Python自然语言处理学习笔记(19):3.3 使用Unicode进行文字处理

 

3.3 Text Processing with Unicode 使用Unicode进行文字处理

 

Our programs will often need to deal with different languages, and different character sets. The concept of “plain text” is a fiction(虚构). If you live in the English-speaking world you probably use ASCII, possibly without realizing it. If you live in Europe you might use one of the extended Latin character sets, containing such characters as “ø” for Danish(丹麦语) and Norwegian(挪威语), “ő” for Hungarian(匈牙利语), “ñ” for Spanish and Breton(法国的布列塔尼语), and “ň” for Czech(捷克语) and Slovak(斯洛伐克语). In this section, we will give an overview of how to use Unicode for processing texts that use non-ASCII character sets.

 

What Is Unicode?   神马是Unicode?

 

Unicode supports over a million characters. Each character is assigned a number, called a code point(编码点). In Python, code points are written in the form \uXXXX, where XXXX is the number in four-digit hexadecimal form(十六进制).

 

Within a program, we can manipulate Unicode strings just like normal strings. However, when Unicode characters are stored in files or displayed on a terminal, they must be encoded as a stream of bytes. Some encodings (such as ASCII and Latin-2) use a single byte per code point, so they can support only a small subset of Unicode, enough for a single language. Other encodings (such as UTF-8) use multiple bytes and can represent the full range of Unicode characters.

 

Text in files will be in a particular encoding, so we need some mechanism for translating it into Unicode—translation into Unicode is called decoding(解码). Conversely, to write out Unicode to a file or a terminal, we first need to translate it into a suitable encoding— this translation out of Unicode is called encoding(编码), and is illustrated in Figure 3-3.

 

        Figure 3-3. Unicode decoding and encoding

 

From a Unicode perspective, characters are abstract entities that can be realized as one or more glyphs(图像字符). Only glyphs can appear on a screen or be printed on paper. A font(字体) is a mapping from characters to glyphs.

 

Extracting Encoded Text from Files 从文件提取编码文本

 

Let’s assume that we have a small text file, and that we know how it is encoded. For example, polish-lat2.txt, as the name suggests, is a snippet(片段) of Polish text (from the Polish Wikipedia; see http://pl.wikipedia.org/wiki/Biblioteka_Pruska ). This file is encoded as Latin-2, also known as ISO-8859-2. The function nltk.data.find() locates the file for us.

 

  >>> path = nltk.data.find('corpora/unicode_samples/polish-lat2.txt')

 

The Python codecs(编码解码器) module provides functions to read encoded data into Unicode strings, and to write out Unicode strings in encoded form. The codecs.open() function takes an encoding parameter to specify the encoding of the file being read or written. So let’s import the codecs module, and call it with the encoding 'latin2' to open our Polish file as Unicode:

 

  > >>> import codecs

>>> f = codecs.open(path, encoding='latin2')

 

 

For a list of encoding parameters allowed by codecs, see http://docs.python.org/lib/standard-encodings.html . Note that we can write Unicode-encoded data to a file using f = codecs.open(path, 'w', encoding='utf-8'). Text read from the file object f will be returned in Unicode. As we pointed out earlier, in order to view this text on a terminal, we need to encode it, using a suitable encoding.

 

The Python-specific encoding unicode_escape is a dummy(傻瓜) encoding that converts all non-ASCII characters into their \uXXXX representations. Code points above the ASCII 0–127 range but below 256 are represented in the two-digit form \xXX.

 

  >>> for line in f:
  ...     line 
= line.strip()
  ...     
print line.encode('unicode_escape')
  Pruska Biblioteka Pa\u0144stwowa. Jej dawne zbiory znane pod nazw\u0105
  
"Berlinka" to skarb kultury i sztuki niemieckiej. Przewiezione przez
  Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y
  odnalezione po 
1945 r. na terytorium Polski. Trafi\u0142y do Biblioteki
  Jagiello\u0144skiej w Krakowie, obejmuj\u0105 ponad 
500 tys. zabytkowych
  archiwali\xf3w, m.
in. manuskrypty Goethego, Mozarta, Beethovena, Bacha.

 

 

The first line in this output illustrates a Unicode escape string preceded by the \u escape string, namely \u0144. The relevant Unicode character will be displayed on the screen as the glyph ń. In the third line of the preceding example, we see \xf3, which corresponds to the glyph ó, and is within the 128–255 range.

 

In Python, a Unicode string literal can be specified by preceding an ordinary string literal with a u, as in u'hello'. Arbitrary Unicode characters are defined using the \uXXXX escape sequence(转义序列) inside a Unicode string literal. We find the integer ordinal(序数) of a character using ord(). For example:

 

  >>> ord('a')
  
97

 

 

The hexadecimal four-digit notation for 97 is 0061, so we can define a Unicode string literal with the appropriate escape sequence:


  >>> a = u'\u0061'
  
>>> a
  u
'a'
  
>>> print a

  a 

 

Notice that the Python print statement is assuming a default encoding of the Unicode character, namely ASCII. However, ń is outside the ASCII range, so cannot be printed unless we specify an encoding. In the following example, we have specified that print should use the repr() of the string, which outputs the UTF-8 escape sequences (of the form \xXX) rather than trying to render(显示) the glyphs.

 

  >>> nacute = u'\u0144'
  
>>> nacute
  u
'\u0144'
  
>>> nacute_utf = nacute.encode('utf8')
  
>>> print repr(nacute_utf)
  
'\xc5\x84'

 

 

If your operating system and locale are set up to render UTF-8 encoded characters, you ought to be able to give the Python command print nacute_utf and see ń on your screen.

Window的命令行不支持UTF-8的,使用print nacute_utf.decode('utf-8')可以显示

 

There are many factors determining what glyphs are rendered on your screen. If you are sure that you have the correct encoding, but your Python code is still failing to produce the glyphs you expected, you should also check that you have the necessary fonts installed on your system.

 

The module unicodedata lets us inspect the properties of Unicode characters. In the following example, we select all characters in the third line of our Polish text outside the ASCII range and print their UTF-8 escaped value, followed by their code point integer using the standard Unicode convention (i.e., prefixing the hex digits with U+), followed by their Unicode name.

 


  
>>> import unicodedata
  
>>> lines = codecs.open(path, encoding='latin2').readlines()
  
>>> line = lines[2]
  
>>> print line.encode('unicode_escape')
  Niemc\xf3w pod koniec II wojny \u015bwiatowej na Dolny \u015al\u0105sk, zosta\u0142y\n
  
>>> for c in line:
  ...     
if ord(c) > 127:
  ...         
print '%r U+%04x %s' % (c.encode('utf8'), ord(c), unicodedata.name(c))
  
'\xc3\xb3' U+00f3 LATIN SMALL LETTER O WITH ACUTE
  
'\xc5\x9b' U+015b LATIN SMALL LETTER S WITH ACUTE
  
'\xc5\x9a' U+015a LATIN CAPITAL LETTER S WITH ACUTE
  
'\xc4\x85' U+0105 LATIN SMALL LETTER A WITH OGONEK
  
'\xc5\x82' U+0142 LATIN SMALL LETTER L WITH STROKE

If you replace the %r (which yields the repr() value) by %s in the format string of the preceding code sample, and if your system supports UTF-8, you should see an output like the following:

 

  ó U+00f3 LATIN SMALL LETTER O WITH ACUTE

  ś U+015b LATIN SMALL LETTER S WITH ACUTE

  Ś U+015a LATIN CAPITAL LETTER S WITH ACUTE

  ą U+0105 LATIN SMALL LETTER A WITH OGONEK

  ł U+0142 LATIN SMALL LETTER L WITH STROKE

 

Alternatively, you may need to replace the encoding 'utf8' in the example by 'latin2', again depending on the details of your system.

The next examples illustrate how Python string methods and the re module accept Unicode strings.


 
  
>>> line.find(u'zosta\u0142y')
  
54
  
>>> line = line.lower()
  
>>> print line.encode('unicode_escape')
  niemc\xf3w pod koniec ii wojny \u015bwiatowej na dolny \u015bl\u0105sk, zosta\u0142y\n
  
>>> import re
  
>>> m = re.search(u'\u015b\w*', line)
  
>>> m.group() 

  u '\u015bwiatowej'

  

NLTK tokenizers allow Unicode strings as input, and correspondingly yield Unicode strings as output.

 

  >>>> nltk.word_tokenize(line) 
  [u
'niemc\xf3w', u'pod', u'koniec', u'ii', u'wojny', u'\u015bwiatowej',
  u
'na', u'dolny', u'\u015bl\u0105sk', u'zosta\u0142y']

 

 

Using Your Local Encoding in Python Python使用你的本地编码

 

If you are used to working with characters in a particular local encoding, you probably want to be able to use your standard methods for inputting and editing strings in a Python file. In order to do this, you need to include the string '# -*- coding: <coding> -*-' as the first or second line of your file. Note that <coding> has to be a string like 'latin-1', 'big5', or 'utf-8' (see Figure 3-4). Figure 3-4 also illustrates how regular expressions can use encoded strings.

 

转载于:https://www.cnblogs.com/yuxc/archive/2011/08/06/2129447.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/275320.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

小程序卡片叠层切换卡片_现在,卡片和清单在哪里?

小程序卡片叠层切换卡片重点 (Top highlight)介绍 (Intro) I was recently tasked to redesign the results of the following filters:我最近受命重新设计以下过滤器的结果&#xff1a; Filtered results for users (creatives) 用户的筛选结果(创意) 2. Filtered results fo…

效率神器!UI 稿智能转换成前端代码

做前端&#xff0c;不搬砖大家好&#xff0c;我是若川。从事前端五年之久&#xff0c;也算见证了前端数次变革&#xff0c;从到DW&#xff08;Dreamweaver&#xff09;到H5C3、从JQuery到MVC框架&#xff0c;无数前端大佬在为打造前端完整生态做出努力&#xff0c;由于他们的努…

$.when.apply_When2Meet vs.LettuceMeet:UI和美学方面的案例研究

$.when.apply并非所有计划应用程序都是一样创建的。 (Not all scheduling apps are created equal.) As any college student will tell you, we use When2Meet almost religiously. Between classes, extracurriculars, work, and simply living, When2Meet is the scheduling…

前端不容你亵渎

大家好&#xff0c;我是若川&#xff0c;点此加我微信进源码群&#xff0c;一起学习源码。同时可以进群免费看Vue专场直播&#xff0c;有尤雨溪分享「Vue3 生态现状以及展望」背景最近我在公众号的后台收到一条留言&#xff1a;言语里充满了对前端的不屑和鄙夷&#xff0c;但仔…

利益相关者软件工程_如何向利益相关者解释用户体验的重要性

利益相关者软件工程With the ever increasing popularity of user experience (UX) design there is a growing need for good designers. However, there’s a problem for designers here as well. How can you show the importance of UX to your stakeholders and convince…

云栖大会上,阿里巴巴重磅发布前端知识图谱!

大家好&#xff0c;我是若川&#xff0c;点此加我微信进源码群&#xff0c;一起学习源码。同时可以进群免费看Vue专场直播&#xff0c;有尤雨溪分享「Vue3 生态现状以及展望」阿里巴巴前端知识图谱&#xff0c;由大阿里众多前端技术专家团历经1年时间精心整理&#xff0c;从 初…

Linux下“/”和“~”的区别

在linux中&#xff0c;”/“代表根目录&#xff0c;”~“是代表目录。Linux存储是以挂载的方式&#xff0c;相当于是树状的&#xff0c;源头就是”/“&#xff0c;也就是根目录。 而每个用户都有”家“目录&#xff0c;也就是用户的个人目录&#xff0c;比如root用户的”家“目…

在当今移动互联网时代_谁在提供当今最好的电子邮件体验?

在当今移动互联网时代Hey, a new email service from the makers of Basecamp was recently launched. The Verge calls it a “genuinely original take on messaging”, and it indeed features some refreshing ideas for the sometimes painful exercise we call inbox man…

React 全新文档上线!

大家好&#xff0c;我是若川&#xff0c;点此加我微信进源码群&#xff0c;一起学习源码。同时可以进群免费看明天的Vue专场直播&#xff0c;有尤雨溪分享「Vue3 生态现状以及展望」&#xff0c;还可以领取50场录播视频和PPT。React 官方文档改版耗时 1 年&#xff0c;今天已完…

网络低俗词_从“低俗小说”中汲取7堂课,以创建有影响力的作品集

网络低俗词重点 (Top highlight)Design portfolios and blockbuster movies had become more and more generic. On the design side, I blame all the portfolio reviews and articles shared by “experienced” designers that repeat the same pieces of advice regardless…

尤雨溪写的100多行的“玩具 vite”,十分有助于理解 vite 原理

1. 前言大家好&#xff0c;我是若川。最近组织了源码共读活动&#xff0c;感兴趣的可以加我微信 ruochuan12想学源码&#xff0c;极力推荐之前我写的《学习源码整体架构系列》jQuery、underscore、lodash、vuex、sentry、axios、redux、koa、vue-devtools、vuex4、koa-compose、…

webflow如何使用_我如何使用Webflow构建辅助项目以帮助设计人员进行连接

webflow如何使用I launched Designer Slack Communities a while ago, aiming to help designers to connect with likeminded people. By sharing my website with the world, I’ve connected with so many designers. The whole experience is a first time for me, so I wa…

重新构想原子化 CSS

感谢印记中文的 QC-L[1] 对本文进行翻译&#xff0c;英文原文: English Version[2]。本文会比往期文章相对长些。这是我个人的一个重大的工具发布&#xff0c;有许多内容我想谈论。如果你能花些时间读完&#xff0c;不胜感激&#xff0c;希望能对你有所帮助 :)推荐访问 https:/…

cv::mat 颜色空间_网站设计基础:负空间

cv::mat 颜色空间Let’s start off by answering this question: What is negative space? It is the “empty” space between and around the subjects of an image. In the context of web design, your “subjects” are the pictures, videos, text, buttons and other e…

[知乎回答] 前端是否要学习 Node.js?

大家好&#xff0c;我是若川。最近组织了源码共读活动&#xff0c;感兴趣的可以加我微信 ruochuan12很多小伙伴都表示收获颇丰。一起学的大多数200行左右的Node.js源码。今天推荐这篇文章。&#xff08;刚刚在写明天掘金要发的文章&#xff0c;差点忘记今天还没发文。在知乎上看…

shields 徽标_我的徽标素描过程

shields 徽标Sketching is arguably the most important part of my process when it comes to logo design. In the beginning of my design career, I would actually skip this step completely and go right to the computer. I’d find myself getting stuck and then goi…

叮咚,系统检测到 npm 有更新,原理揭秘!

大家好&#xff0c;我是若川。最近组织了源码共读活动&#xff0c;感兴趣的可以加我微信 ruochuan12本文来自V同学投稿的源码共读第六期笔记&#xff0c;写得很有趣。现在已经进行到第十期了。你或许经常看见 npm 更新的提示。npm 更新提示面试官可能也会问你&#xff0c;组件库…

ui设计未来十年前景_UI设计的10条诫命

ui设计未来十年前景重点 (Top highlight)The year is approximately 1,300 BC when Moses received the 10 UI design commandments from the almighty design gods. The list was comprised of best practices that only the most enlightened designers would be aware of.当…

w3ctech 2011 北京站(组图)

门前的牌子大厅一推低价技术书籍会场嘉宾席人渐渐到齐准备工作w3c中国区负责人 安琪 第一个演讲焦峰同学分享了浏览器兼容性的相关问题石川同学分享的是JQuery的相关内容摄影哥微博大屏幕&#xff0c;有亮点哦。。。MBP啊有木有&#xff5e;&#xff5e;&#xff5e;貘大现场提…

浏览器中的 ESM

大家好&#xff0c;我是若川。最近组织了源码共读活动&#xff0c;感兴趣的可以加我微信 ruochuan12早期的web应用非常简单&#xff0c;可以直接加载js的形式去实现。随着需求的越来越多&#xff0c;应用越做越大&#xff0c;需要模块化去管理项目中的js、css、图片等资源。这里…