对于一个句子,一种简单的方法是使用split()
a = 'This is an apple. Do you like apple?'
b = a.split()
print(b) # ['This', 'is', 'an', 'apple.', 'Do', 'you', 'like', 'apple?']
可以看到切分结果不错,但标点符号也当成了词的一部分,可以使用正则表达式来切分句子,其中分隔符是除字母,数字外的任意字符串。
import rea = 'This is an apple. Do you like apple?'
b = re.split(r'\W+', a)
print(b) # ['This', 'is', 'an', 'apple', 'Do', 'you', 'like', 'apple', '']
得到的词列表已不包含符号,但是含有空字符串,同时单词也混有大小写,将其改进得到
import rea = 'This is an apple. Do you like apple?'
b = re.split(r'\W+', a)
c = [word.lower() for word in b if len(word) > 0]
print(c) # ['this', 'is', 'an', 'apple', 'do', 'you', 'like', 'apple']