在Python3中,以二进制模式打开文件会得到bytes的结果。迭代一个bytes对象可以得到0到255(包括0到255)的整数,而不是字符。从^{} documentation:While bytes literals and representations are based on ASCII text, bytes objects actually behave like immutable sequences of integers, with each value in the sequence restricted such that 0 <= x < 256
将string.printable转换为一个集合,并对其进行测试:printable = {ord(c) for c in string.printable}
以及
^{pr2}$
接下来,您希望附加到bytesarray()对象以保持合理的性能,并从ASCII解码以产生str结果:printable = {ord(c) for c in string.printable}
with open(filename, "rb") as f:
result = bytearray()
for c in f.read():
if c in printable:
result.append(c)
continue
if len(result) >= min:
yield result.decode('ASCII')
result.clear()
if len(result) >= min: # catch result at EOF
yield result
与逐个迭代字节不同,您可以对任何可打印的而不是进行拆分:import re
nonprintable = re.compile(b'[^%s]+' % re.escape(string.printable.encode('ascii')))
with open(filename, "rb") as f:
for result in nonprintable.split(f.read()):
if result:
yield result.decode('ASCII')
我会尝试将文件分块读取,而不是一次性读取;不要试图一次性将大文件放入内存中:with open(filename, "rb") as f:
buffer = b''
for chunk in iter(lambda: f.read(2048), b''):
splitresult = nonprintable.split(buffer + chunk)
buffer = splitresult.pop()
for string in splitresult:
if string:
yield string.decode('ascii')
if buffer:
yield buffer.decode('ascii')
缓冲区将任何不完整的单词从一个块带到下一个块;re.split()如果输入分别以不可打印字符开始或结束,则在开始和结束处生成空值。在