当使用Spacy进行自然语言处理时,常见的用例包括文本分词、命名实体识别、词性标注、句法分析等。下面是一些常见的使用例子及相应的代码:
文本分词(Tokenization):
将文本划分成单词或标点符号等基本单元。
import spacy# 加载英文模型
nlp = spacy.load("en_core_web_sm")
# 文本分词
text = "This is a sample sentence."
doc = nlp(text)# 输出分词结果
for token in doc:print(token.text)
运行结果
This
is
a
sample
sentence
.
命名实体识别(Named Entity Recognition):
识别文本中的命名实体,如人名、地名、组织机构等。
import spacy# 加载英文模型
nlp = spacy.load("en_core_web_sm")
# 文本
text = "Apple is a big company, headquartered in Cupertino, California."
# 处理文本
doc = nlp(text)
# 提取命名实体
for ent in doc.ents:print(ent.text, ent.label_)
运行结果:
Apple ORG
Cupertino GPE
California GPE
词性标注(Part-of-speech Tagging):
标注文本中每个词的词性
import spacy# 加载英文模型
nlp = spacy.load("en_core_web_sm")# 文本
text = "This is a sample sentence."# 处理文本
doc = nlp(text)# 输出词性标注结果
for token in doc:print(token.text, token.pos_)
运行结果:
This PRON
is AUX
a DET
sample NOUN
sentence NOUN
. PUNCT
句法分析(Dependency Parsing):
分析文本中单词之间的依赖关系。
import spacy# 加载英文模型
nlp = spacy.load("en_core_web_sm")# 文本
text = "Apple is looking at buying U.K. startup for $1 billion"# 处理文本
doc = nlp(text)# 输出句法依赖关系
for token in doc:print(token.text, token.dep_, token.head.text, token.head.pos_,[child for child in token.children])
运行结果:
Apple nsubj looking VERB []
is aux looking VERB []
looking ROOT looking VERB [Apple, is, at, startup]
at prep looking VERB [buying]
buying pcomp at ADP [U.K.]
U.K. dobj buying VERB []
startup dep looking VERB [for]
for prep startup NOUN [billion]
$ quantmod billion NUM []
1 compound billion NUM []
billion pobj for ADP [$, 1]
英文分句
import spacy
nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("sentencizer")
doc = nlp("This is a sentence. This is another sentence.")
for sentence in doc.sents:print(sentence)
运行结果:
This is a sentence.
This is another sentence.
关键字抽取
import spacynlp = spacy.load("en_core_web_sm")
text= """Please ignore that NLLB is not made to translate this large number of tokens at once. Again, I am more interest in the computational limits I have.I already use torch.no_grad() and put the model in evaluation mode which I read online should safe some memory. My full code to run the inference looks like this:"""doc = nlp(text)
keywords = [token.text for token in doc if token.pos_ in ['NOUN', 'PROPN']]
print(keywords)
运行结果:
['NLLB', 'number', 'tokens', 'interest', 'limits', 'torch.no_grad', 'model', 'evaluation', 'mode', 'memory', 'code', 'inference']
句子相似度的比较
import spacy
nlp = spacy.load("en_core_web_lg")doc1 = nlp(u'the person wear red T-shirt')
doc2 = nlp(u'this person is walking')
doc3 = nlp(u'the boy wear red T-shirt')print(doc1.similarity(doc2))
print(doc1.similarity(doc3))
print(doc2.similarity(doc3))
运行结果:
0.7003971105290047
0.9671912343259517
0.6121211244876517
Model Architectures · spaCy API Documentation