sklearn实战-乳腺癌细胞数据挖掘(博客主亲自录制视频教程)

https://study.163.com/course/introduction.htm?courseId=1005269003&utm_campaign=commission&utm_source=cp-400000000398149&utm_medium=share

http://www.tuicool.com/articles/feAfi2

NLTK读书笔记 — 分类与标注

时间 2009-10-20 15:58:44 SuperAngevil's Blog

原文 http://superangevil.wordpress.com/2009/10/20/nltk5/

主题 NLTK

0. 本章所关注的问题

(1) 什么是lexical categories，怎样将它们应用于NLP?
(2) 什么样的python数据结构更适合存储词和它们的类别?
(3) 我们怎样自动地给词做标注

另外，本章还会包含NLP中一些基础的技术： sequence labeling , n-gram models , backoff , evaluation

在典型的NLP中，第一步是将文本流切分成语义单元( tokenization 如分词)，第二步就是词性标注( POS tagging )

1. 使用Tagger工具

>>> import nltk

>>> text = nltk.word_tokenize(“And now for something completely different”)

>>> nltk.pos_tag(text)

POS-tagger处理一个词序列，并且给每个词标定一个词性，以列表的形式返回

2. Tagged Corpora

(1) 重表达标记的token

>>> tagged_token = nltk.tag.str2tuple(‘fly/NN’)

>>> tagged_token[0] == ‘fly’

>>> tagged_token[1] == ‘NN’

>>> [nltk.tag.str2tuple(t) for t in sent.split()]

(2) 读取标记的语料库

NLTK的corpus reader提供一个唯一的读取标记语料库的接口 tagged_words ():

>>> nltk.corpus.brown.tagged_words()>>> nltk.corpus.brown.tagged_words(simplify_tags=True)

NLTK中还包括中文的语料库 — Unicode编码；若语料库同时还按句切分过，那么它将会有一个tagged_sents()方法

(3) 简化的POS Tagset

Tag	Meaning	Examples
`ADJ`	adjective	new, good, high, special, big, local
`ADV`	adverb	really, already, still, early, now
`CNJ`	conjunction	and, or, but, if, while, although
`DET`	determiner	the, a, some, most, every, no
`EX`	existential	there, there’s
`FW`	foreign word	dolce, ersatz, esprit, quo, maitre
`MOD`	modal verb	will, can, would, may, must, should
`N`	noun	year, home, costs, time, education
`NP`	proper noun	Alison, Africa, April, Washington
`NUM`	number	twenty-four, fourth, 1991, 14:24
`PRO`	pronoun	he, their, her, its, my, I, us
`P`	preposition	on, of, at, with, by, into, under
`TO`	the word to	to
`UH`	interjection	ah, bang, ha, whee, hmpf, oops
`V`	verb	is, has, get, do, make, see, run
`VD`	past tense	said, took, told, made, asked
`VG`	present participle	making, going, playing, working
`VN`	past participle	given, taken, begun, sung
`WH`	wh determiner	who, which, when, what, where, how

>>> from nltk.corpus import brown

>>> brown_news_tagged = brown.tagged_words(categories=’news’, simplify_tags=True)

>>> tag_fd = nltk.FreqDist(tag for (word, tag) in brown_news_tagged)

>>> tag_fd.keys()

名词 Nons: 通常指代人、地点、事情、概念

动词 Verbs: 用以描述事件和行为

形容词和副词 Adjectives and Adverbs: 形容词用来描述名词，副词用来描述动词…

>>> wsj = nltk.corpus.treebank.tagged_words(simplify_tags = True)

>>> word_tag_fd = nltk.FreqDist(wsj)

>>> [word + “/” + tag for (word, tag) in word_tag_fd if tag.startwith(‘V’)]

(5) 使用标注语料库

>>> brown_learned_text = brown.words(categories=’learned’)

>>> sorted(set(b for (a, b) in nltk.ibigrams(brown_learned_text) if a == ‘often’))

>>> brown_lrnd_tagged = brown.tagged_words(categories=’learned’, simplify_tags=True)

>>> tags = [b[1] for (a, b) in nltk.ibigrams(brown_lrnd_tagged) if a[0] == ‘often’]

>>> fd = nltk.FreqDist(tags)

>>> fd.tabulate()

>>> brown_news_tagged = brown.tagged_words(categories=’news’, simplify_tags=True)

>>> data = nltk.ConditionalFreqDist((word.lower(), tag)

… for (word, tag) in brown_news_tagged)

>>> for word in data.conditions():

… if len(data[word]) > 3:

… tags = data[word].keys()

… print word, ‘ ‘.join(tags)

3. 使用Python的词典将词与属性之间建立映射

POS-Tagging中每个词都会对应一个tag, 很自然地，要建立词与属性的映射

python的dict提供一种defaultdict，nltk也提供一种 nltk.defauldict ，这样使得使用不在dict中的key取value时不抛出异常，而给出默认值

key和value都可以很复杂

4. Automatic Tagging 自动标注

>>> from nltk.corpus import brown

>>> brown_tagged_sents = brown.tagged_sents(categories=’news’)

>>> brown_sents = brown.sents(categories=’news’)

(1) The Default Tagger

最简单的标注方法就是为每个token标注同样地tag — 把most likely tag标注给每个token是最“懒”的简便方法:

>>> tags = [tag for (word, tag) in brown.tagged_words(categories=’news’)]>>> nltk.FreqDist(tags).max()

执行这两句发现最常见的tag是’NN’

下面就创建一个tagger把所有的token都标为NN:

>>> raw = ‘I do not like green eggs and ham, I do not like them Sam I am!’

>>> tokens = nltk.word_tokenize(raw)

>>> default_tagger = nltk.DefaultTagger(‘NN’)

>>> default_tagger.tag(tokens)

当然这种tagger的实际效果很差：

>>> default_tagger.evaluate(brown_tagged_sents)0.13089484257215028

the default tagger的作用就是 增加语言处理系统的鲁棒性

(2) The Regular Expression Tagger

使用 正则表达式 匹配的tagger

>>> patterns = [

… (r’.*ing$’, ‘VBG’), # gerunds

… (r’.*ed$’, ‘VBD’), # simple past

… (r’.*es$’, ‘VBZ’), # 3rd singular present

… (r’.*ould$’, ‘MD’), # modals

… (r’.*\’s$’, ‘NN$’), # possessive nouns

… (r’.*s$’, ‘NNS’), # plural nouns

… (r’^-?[0-9]+(.[0-9]+)?$’, ‘CD’), # cardinal numbers

… (r’.*’, ‘NN’) # nouns (default)

… ]

>>> regexp_tagger = nltk.RegexpTagger(patterns)

>>> regexp_tagger.tag(brown_sents[3])

相比与默认的tagger, 正则表达式tagger的效果要好一些

>>> regexp_tagger.evaluate(brown_tagged_sents)0.20326391789486245

(3) The Lookup Tagger

我们找出100个出现频率最高的词并存储其tag — 使用这种信息作为一个”lookup tagger”的模型(在NLTK中是UnigramTagger):

>>> fd = nltk.FreqDist(brown.words(categories=’news’))

>>> cfd = nltk.ConditionalFreqDist(brown.tagged_words(categories=’news’))

>>> most_freq_words = fd.keys()[:100]

>>> likely_tags = dict((word, cfd[word].max()) for word in most_freq_words)

>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags)

>>> baseline_tagger.evaluate(brown_tagged_sents)

0.45578495136941344

比之前两个，这种tagger的效果又要好一些

我们首先使用lookup table, 如果不能决定一个token的tag，我们再使用default tagger — 这个过程就称为backoff

那么这个过程怎么实现呢：将default tagger作为lookup tagger的输入参数

>>> baseline_tagger = nltk.UnigramTagger(model=likely_tags,… backoff=nltk.DefaultTagger(‘NN’))

(4) 评估 Evaluation

对各种工具的评估一直是NLP的核心主题之一 — 首先，最”好”的评估方式是语言专家的评估? 另外，我们可以使用 gold standard test data进行评估

5. N-Gram Tagging N元语法标注

(1) Unigram Tagging 一元语法标注

Unigram tagger基于这样一个简单的统计算法:

for each token, assign the tag that is most likely for that particular token

简单来说，对于词frequent，总是将其标为JJ，因为它作为JJ的情形是最多的

unigram tagger的行为与lookup tagger差不多，所不同的是构建过程： unigram tagger是通过训练(training)过程构建的 — 通过将tagged sentence data作为参数初始化时传递给UnigramTagger来进行训练:

>>> from nltk.corpus import brown

>>> brown_tagged_sents = brown.tagged_sents(categories=’news’)

>>> brown_sents = brown.sents(categories=’news’)

>>> unigram_tagger = nltk.UnigramTagger(brown_tagged_sents)

>>> unigram_tagger.tag(brown_sents[2007])

>>> unigram_tagger.evaluate(brown_tagged_sents)

0.9349006503968017

(2) 将训练数据与测试数据分开

一般将数据集分开，训练90%测试10%

(3) N-Gram Tagging N元语法标注

一元语法结构只考虑当前的token，我们可以向前多看几个token来决定对当前token的标注，这就是N元语法的含义 — 考虑当前token和之前处理的n-1个token — context变大了

N-Gram Tagger中一个特殊的二元语法标注器 bigram tagger

>>> bigram_tagger = nltk.BigramTagger(train_sents)

>>> bigram_tagger.tag(brown_sents[2007])

>>> unseen_sent = brown_sents[4203]

>>> bigram_tagger.tag(unseen_sent)

>>> bigram_tagger.evaluate(test_sents)

0.10276088906608193

可以看出Bigram-Tagger对未见过的句子的标注非常差！

N增大时， As n gets larger, the specificity of the contexts increases, as does the chance that the data we wish to tag contains contexts that were not present in the training data. This is known as the sparse data problem, and is quite pervasive in NLP. 这就导致accuracy与coverage之间的权衡问题: precision/recall trade off

(4) Combining Taggers 标注器的组合

为了获取准确性与覆盖性之间的权衡，一种方法是使用更精确的算法，但这一般又要求算法的覆盖率同时要高

另一种方法就是使用标注器的组合：

1) 首先尝试使用bigram tagger进行标注

2) 对于bigram tagger不能标注的token, 尝试使用unigram tagger

3) 对于unigram tagger也不能标注的token, 使用default tagger

>>> t0 = nltk.DefaultTagger(‘NN’)

>>> t1 = nltk.UnigramTagger(train_sents, backoff=t0)

>>> t2 = nltk.BigramTagger(train_sents, backoff=t1)

>>> t2.evaluate(test_sents)

0.84491179108940495

(5) Tagging Unknown Words

对未知词的处理可以使用regular-expression-tagger或default-tagger作为backoff

一种有用的方法A useful method to tag unknown words based on context is to limit the vocabulary of a tagger to the most frequent n words, and to replace every other word with a special word UNK.During training, a unigram tagger will probably learn that UNK is usually a noun. However, the n-gram taggers will detect contexts in which it has some other tag. For example, if the preceding word is to (tagged TO), then UNK will probably be tagged as a verb.

(6) Storing Taggers

训练通常是个漫长的过程，将已经训练过的tagger存起来，以备以后重用

>>> from cPickle import dump

>>> output = open(‘t2.pkl’, ‘wb’)

>>> dump(t2, output, -1)

>>> output.close()

load过程如下

>>> from cPickle import load

>>> input = open(‘t2.pkl’, ‘rb’)

>>> tagger = load(input)

>>> input.close()

(7) Performance Limitations

1) 可以考虑其遇到的ambiguity

>>> cfd = nltk.ConditionalFreqDist(

… ((x[1], y[1], z[0]), z[1])

… for sent in brown_tagged_sents

… for x, y, z in nltk.trigrams(sent))

>>> ambiguous_contexts = [c for c in cfd.conditions() if len(cfd[c]) > 1]

>>> sum(cfd[c].N() for c in ambiguous_contexts) / cfd.N()

0.049297702068029296

2) 可以研究其错误 — confusion matrix

>>> test_tags = [tag for sent in brown.sents(categories=’editorial’)

… for (word, tag) in t2.tag(sent)]

>>> gold_tags = [tag for (word, tag) in brown.tagged_words(categories=’editorial’)]

>>> print nltk.ConfusionMatrix(gold, test)

基于这些判断我们就可以修改我们的tagset

6. Transformation-Based Tagging

N-Gram Tagger的table size随着N的增大而增大，下面有一种新的tagging方法:Brill tagging, 一种归纳的tagging方法，size比n-gram tagging小得多（只是它很小的一部分）

Brill tagging是一种基于状态转移的学习方法 (a kind of transformation-based leaning )，其基本思想为：猜每个词的tag, 然后回头去修正错误.这样，一个Brill tagger就连续地将一个不好的tagging转成一个好一些的，……

在n-gram tagging中，这是一个受监督的学习过程(a supervised learning method) — 我们需要使用注释好的训练数据来判断tagger的猜测是否正确

Brill tagging可以用着色问题来进行类比：假定我们给一棵树做着色，要在一个天蓝色的背景下对其所有的细节包括主干(boughs)、分枝 (branches)、细枝(twigs)、叶子(leaves)进行着色，我们不是首先给树着色，然后再在其他地方着蓝色；而是首先简单地将整个画布着成蓝色，然后“修正”树所在的部分，再已有蓝色的基础上重新着色 — begin with broad brush strokes then fix up the details, with successively finer changes .

举例说明，给下句做标注：

The President said he will ask Congress to increase grants to states for vocational rehabilitation

我们首先使用unigram-tagger进行标注，然后按如下规则进行修正：(a) 当前一个词是TO时，使用VB来代替NN (b) 当下一个词是NNS时使用IN来代替TO，过程如下：

Phrase	to	increase	grants	to	states	for	vocational	rehabilitation
Unigram	TO	NN	NNS	TO	NNS	IN	JJ	NN
Rule 1		VB
Rule 2				IN
Output	TO	VB	NNS	IN	NNS	IN	JJ	NN
Gold	TO	VB	NNS	IN	NNS	IN	JJ	NN