Estimates state that 70%–85% of the world’s data is text (unstructured data). Most of the English and EU business data formats as byte text, MS Word, or Adobe PDF. [1]
据估计,全球数据的70%–85%是文本(非结构化数据)。 大多数英语和欧盟业务数据格式为字节文本,MS Word或Adobe PDF。 [1]
Organizations web displays of Adobe Postscript Document Format documents (PDF). [2]
组织Web显示Dobe Postscript文档格式文档( PDF )。 [2]
In this blog, I detail the following :
在此博客中,我将详细介绍以下内容:
- Create a file path from the web file name and local file name; 从Web文件名和本地文件名创建文件路径;
- Change byte encoded Gutenberg project file into a text corpus; 将字节编码的Gutenberg项目文件更改为文本语料库;
- Change a PDF document into a text corpus; 将PDF文档更改为文本语料库;
- Segment continuous text into a Corpus of word text. 将连续文本分割成单词文本的语料库。
将常用文档格式转换为文本 (Converting Popular Document Formats into Text)
1.从Web文件名或本地文件名创建本地文件路径 (1. Create local filepath from the web filename or local filename)
The following function will take either a local file name or a remote file URL and return a filepath object.
以下函数将采用本地文件名或远程文件URL并返回文件路径对象。
#in file_to_text.py
--------------------------------------------
from io import StringIO, BytesIO
import urllib
def file_or_url(pathfilename:str) -> Any:
"""
Reurn filepath given local file or URL.
Args:
pathfilename:
Returns:
filepath odject istance
"""
try:
fp = open(pathfilename, mode="rb") # file(path, 'rb')
except:
pass
else:
url_text = urllib.request.urlopen(pathfilename).read()
fp = BytesIO(url_text)
return fp
2.将Unicode字节编码的文件更改为Python Unicode字符串 (2. Change Unicode Byte encoded file into a o Python Unicode String)
You will often encounter text blob downloads in the size 8-bit Unicode format (in the romantic languages). You need to convert 8-bit Unicode into Python Unicode strings.
您经常会遇到8位Unicode格式的文本blob下载(浪漫语言)。 您需要将8位Unicode转换为Python Unicode字符串。
#in file_to_text.py
--------------------------------------------
def unicode_8_to_text(text: str) -> str:
return text.decode("utf-8", "replace")import urllib
from file_to_text import unicode_8_to_texttext_l = 250text_url = r'http://www.gutenberg.org/files/74/74-0.txt'
gutenberg_text = urllib.request.urlopen(text_url).read()
%time gutenberg_text = unicode_8_to_text(gutenberg_text)
print('{}: size: {:g} \n {} \n'.format(0, len(gutenberg_text) ,gutenberg_text[:text_l]))
output =>
输出=>
CPU times: user 502 µs, sys: 0 ns, total: 502 µs
Wall time: 510 µs
0: size: 421927
The Project Gutenberg EBook of The Adventures of Tom Sawyer, Complete by
Mark Twain (Samuel Clemens)
This eBook is for the use of anyone anywhere at no cost and with almost
no restrictions whatsoever. You may copy it, give it away or re-use
it under the terms of the Project Gutenberg License included with this
eBook or online at www.guten
The result is that text.decode('utf-8')
can format into a Python string of a million characters in about 1/1000th second. A rate that far exceeds our production rate requirements.
结果是text.decode('utf-8')
可以在大约1/1000秒内格式化为一百万个字符的Python字符串。 生产率远远超过我们的生产率要求。
3.将PDF文档更改为文本语料库。 (3. Change a PDF document into a text corpus.)
“Changing a PDF document into a text corpus" is one of the most troublesome and common tasks I do for NLP text pre-processing.
“将PDF文档转换为文本语料库 ”是我为NLP文本预处理所做的最麻烦,最常见的任务之一。
#in file_to_text.py
--------------------------------------------
def PDF_to_text(pathfilename: str) -> str:
"""
Chane PDF format to text.
Args:
pathfilename:
Returns:
"""
fp = file_or_url(pathfilename)
rsrcmgr = PDFResourceManager()
retstr = StringIO()
laparams = LAParams()
device = TextConverter(rsrcmgr, retstr, laparams=laparams)
interpreter = PDFPageInterpreter(rsrcmgr, device)
password = ""
maxpages = 0
caching = True
pagenos = set()
for page in PDFPage.get_pages(
fp,
pagenos,
maxpages=maxpages,
password=password,
caching=caching,
check_extractable=True,
):
interpreter.process_page(page)
text = retstr.getvalue()
fp.close()
device.close()
retstr.close()
return text
-------------------------------------------------------arvix_list =['https://arxiv.org/pdf/2008.05828v1.pdf'
, 'https://arxiv.org/pdf/2008.05981v1.pdf'
, 'https://arxiv.org/pdf/2008.06043v1.pdf'
, 'tmp/inf_finite_NN.pdf' ]
for n, f in enumerate(arvix_list):
%time pdf_text = PDF_to_text(f).replace('\n', ' ')
print('{}: size: {:g} \n {} \n'.format(n, len(pdf_text) ,pdf_text[:text_l])))
output =>
输出=>
CPU times: user 1.89 s, sys: 8.88 ms, total: 1.9 s
Wall time: 2.53 s
0: size: 42522
On the Importance of Local Information in Transformer Based Models Madhura Pande, Aakriti Budhraja, Preksha Nema Pratyush Kumar, Mitesh M. Khapra Department of Computer Science and Engineering Robert Bosch Centre for Data Science and AI (RBC-DSAI) Indian Institute of Technology Madras, Chennai, India {mpande,abudhra,preksha,pratyush,miteshk}@
CPU times: user 1.65 s, sys: 8.04 ms, total: 1.66 s
Wall time: 2.33 s
1: size: 30586
ANAND,WANG,LOOG,VANGEMERT:BLACKMAGICINDEEPLEARNING1BlackMagicinDeepLearning:HowHumanSkillImpactsNetworkTrainingKanavAnand1anandkanav92@gmail.comZiqiWang1z.wang-8@tudelft.nlMarcoLoog12M.Loog@tudelft.nlJanvanGemert1j.c.vangemert@tudelft.nl1DelftUniversityofTechnology,Delft,TheNetherlands2UniversityofCopenhagenCopenhagen,DenmarkAbstractHowdoesauser’sp
CPU times: user 4.82 s, sys: 46.3 ms, total: 4.87 s
Wall time: 6.53 s
2: size: 57204
0 2 0 2 g u A 3 1 ] G L . s c [ 1 v 3 4 0 6 0 . 8 0 0 2 : v i X r a Offline Meta-Reinforcement Learning with Advantage Weighting Eric Mitchell1, Rafael Rafailov1, Xue Bin Peng2, Sergey Levine2, Chelsea Finn1 1 Stanford University, 2 UC Berkeley em7@stanford.edu Abstract Massive datasets have proven critical to successfully
CPU times: user 12.2 s, sys: 36.1 ms, total: 12.3 s
Wall time: 12.3 s
3: size: 89633
0 2 0 2 l u J 1 3 ] G L . s c [ 1 v 1 0 8 5 1 . 7 0 0 2 : v i X r a Finite Versus Infinite Neural Networks: an Empirical Study Jaehoon Lee Samuel S. Schoenholz∗ Jeffrey Pennington∗ Ben Adlam†∗ Lechao Xiao∗ Roman Novak∗ Jascha Sohl-Dickstein {jaehlee, schsam, jpennin, adlam, xlc, romann, jaschasd}@google.com Google Brain
On this hardware configuration, “Converting a PDF file into Python string” requires 150 seconds per million characters. Not fast enough for a Web interactve production application.
在此硬件配置上,“ 将PDF文件转换为Python字符串 ”每百万个字符需要150秒。 对于Web interactve生产应用程序来说不够快。
You may want to stage formatting in the background.
您可能要在后台进行格式化。
4.将连续文本分割成单词文本的语料库 (4. Segment continuous text into Corpus of word text)
When we read https://arxiv.org/pdf/2008.05981v1.pdf', it came back as continuous text with no separation character. Using the package from wordsegment, we separate the continuous string into words.
当我们阅读https://arxiv.org/pdf/2008.05981v1.pdf'时 ,它以没有分隔符的连续文本形式出现。 使用来自wordsegment的包,我们将连续的字符串分成单词。
from wordsegment import load, clean, segment
%time words = segment(pdf_text)
print('size: {:g} \n'.format(len(words)))
' '.join(words)[:text_l*4]
output =>
输出=>
CPU times: user 1min 43s, sys: 1.31 s, total: 1min 44s
Wall time: 1min 44s
size: 5005'an and wang loog van gemert blackmagic in deep learning 1 blackmagic in deep learning how human skill impacts network training kanavanand1anandkanav92g mailcom ziqiwang1zwang8tudelftnl marco loog12mloogtudelftnl jan van gemert 1jcvangemerttudelftnl1 delft university of technology delft the netherlands 2 university of copenhagen copenhagen denmark abstract how does a users prior experience with deep learning impact accuracy we present an initial study based on 31 participants with different levels of experience their task is to perform hyper parameter optimization for a given deep learning architecture the results show a strong positive correlation between the participants experience and then al performance they additionally indicate that an experienced participant nds better solutions using fewer resources on average the data suggests furthermore that participants with no prior experience follow random strategies in their pursuit of optimal hyperparameters our study investigates the subjective human factor in comparisons of state of the art results and scientic reproducibility in deep learning 1 introduction the popularity of deep learning in various elds such as image recognition 919speech1130 bioinformatics 2124questionanswering3 etc stems from the seemingly favorable tradeoff between the recognition accuracy and their optimization burden lecunetal20 attribute their success t'
You will notice that wordsegment accomplishes a fairly accurate separation into words. There are some errors , or words that we don’t want, that NLP text pre-processing clear away.
您会注意到, wordsegment实现了相当准确的单词分离。 NLP文本预处理会清除一些错误或我们不希望使用的单词。
The Apache wordsegment is slow. It is barely adequate in production for small, less than 1 thousand word documents. Can we find some faster way to segment?
Apache 单词段速度很慢。 对于少于一千个单词的小型文档,它几乎不能满足生产要求。 我们可以找到更快的细分方式吗?
4b。 将连续文本分割成单词文本的语料库 (4b. Segment continuous text into Corpus of word text)
There seems to be a faster method to "Segment continuous text into Corpus of word text."
似乎有一种更快的方法“将连续文本分割成单词文本的语料库”。
As discussed in the following blog:
如以下博客中所述:
SymSpell is 100x -1000x faster. Wow!
SymSpell是100倍-1000x更快。 哇!
Note: ed: 8/24/2020 Wolf Garbe deserves credit for pointing out
注意:ed:8/24/2020 Wolf Garbe值得一提
The benchmark results (100x -1000x faster) given in the SymSpell blog post are referring solely to spelling correction, not to word segmentation. In that post SymSpell was compared to other spelling correction algorithms, not to word segmentation algorithms. — Wolfe Garbe 8/23/2020
SymSpell博客文章中给出的基准测试结果(快100倍-1000倍)仅指拼写校正,而不是指分词。 在那篇文章中,将SymSpell与其他拼写校正算法进行了比较,而不是与分词算法进行了比较。 -Wolfe Garbe 2020年8月23日
and
和
Also, there is an easier way to call a C# library from Python: https://stackoverflow.com/questions/7367976/calling-a-c-sharp-library-from-python — Wolfe Garbe 8/23/2020
此外,还有一种从Python调用C#库的简便方法: https ://stackoverflow.com/questions/7367976/calling-ac-sharp-library-from-python — Wolfe Garbe 8/23/2020
Note: ed: 8/24/2020. I am going to try Garbe's C## implementation. If I do not get the same results (and probably if I do) I will try cython port and see if I can fit into spacy as a pipeline element. I will let you know my results.
注意:ed:8/24/2020。 我将尝试Garbe的C ##实现。 如果没有得到相同的结果(可能的话),我将尝试cython port,看看是否可以将spacy作为管道元素使用。 我会让你知道我的结果。
However, it is implemented in C#. Since I am not going down the infinite ratholes of:
但是,它是用C#实现的。 由于我没有遇到以下无限困难:
Convert all my NLP into C#. Not a viable option.
将我所有的NLP转换为C# 。 不是可行的选择。
Calling C# from Python. I talked to two engineer managers of Python groups. They have Python-C# capability, but it involves :
从Python调用C# 。 我与两位Python组的工程师经理进行了交谈。 它们具有Python-C#功能,但涉及到:
Note:
注意:
Translating to VB-vanilla;
翻译成VB -vanilla;
- Manual intervention and translation must pass tests for reproducibility; 手动干预和翻译必须通过再现性测试;
Translating from VB-vanilla to C;
从VB- vanilla转换为C ;
- Manual intervention and translation must pass tests for reproducibility. 手动干预和翻译必须通过再现性测试。
Instead, we work with a port to Python. Here is a version:
相反,我们使用Python的端口。 这是一个版本:
def segment_into_words(input_term):
# maximum edit distance per dictionary precalculation
max_edit_distance_dictionary = 0
prefix_length = 7
# create object
sym_spell = SymSpell(max_edit_distance_dictionary, prefix_length)
# load dictionary
dictionary_path = pkg_resources.resource_filename(
"symspellpy", "frequency_dictionary_en_82_765.txt")
bigram_path = pkg_resources.resource_filename(
"symspellpy", "frequency_bigramdictionary_en_243_342.txt")
# term_index is the column of the term and count_index is the
# column of the term frequency
if not sym_spell.load_dictionary(dictionary_path, term_index=0,
count_index=1):
print("Dictionary file not found")
return
if not sym_spell.load_bigram_dictionary(dictionary_path, term_index=0,
count_index=2):
print("Bigram dictionary file not found")
returnresult = sym_spell.word_segmentation(input_term)
return result.corrected_string%time long_s = segment_into_words(pdf_text)
print('size: {:g} {}'.format(len(long_s),long_s[:text_l*4]))
output =>
输出=>
CPU times: user 20.4 s, sys: 59.9 ms, total: 20.4 s
Wall time: 20.4 s
size: 36585 ANAND,WANG,LOOG,VANGEMER T:BLACKMAGICINDEEPLEARNING1B lack MagicinDeepL earning :HowHu man S kill Imp acts Net work T raining Ka nav An and 1 an and kana v92@g mail . com ZiqiWang1z. wang -8@tu delft .nlM arc oLoog12M.Loog@tu delft .nlJ an van Gemert1j.c. vang emert@tu delft .nl1D elf tUniversityofTechn ology ,D elf t,TheN ether lands 2UniversityofC open hagen C open hagen ,Den mark Abs tract How does a user ’s prior experience with deep learning impact accuracy ?We present an initial study based on 31 participants with different levels of experience .T heir task is to perform hyper parameter optimization for a given deep learning architecture .T here -s ult s show a strong positive correlation between the participant ’s experience and the fin al performance .T hey additionally indicate that an experienced participant finds better sol u-t ions using fewer resources on average .T he data suggests furthermore that participants with no prior experience follow random strategies in their pursuit of optimal hyper pa-ra meters .Our study investigates the subjective human factor in comparisons of state of the art results and sci entific reproducibility in deep learning .1Intro duct ion T he popularity of deep learning in various fi eld s such as image recognition [9,19], speech [11,30], bio informatics [21,24], question answering [3] etc . stems from the seemingly fav or able trade - off b
SymSpellpy is is about 5x faster implemented in Python.We are not seeing 100x -1000x faster.
SymSpellpy在Python中实现的速度要快大约5倍。我们看不到100倍-1000倍的速度。
I guess that SymSpell-C# is comparing to different segmentation algorithms implemented in Python.
我猜想SymSpell-C#正在与Python中实现的不同细分算法进行比较。
Perhaps we see speedup due to C#, a compiled statically typed language. Since C# and C are about the same computing speed, we should expect a speedup of C# 100x -1000x faster than a Python implementation.
也许由于C# (一种编译的静态类型语言)而使我们看到了加速。 由于C#和C的计算速度大致相同,因此我们应该期望C#的加速比Python实现快100倍-1000倍。
Note: There is a spacy pipeline implementation spacy_symspell, which directly calls SymSpellpy. I recommend you don’t use spacy_symspell. Spacy first generates tokens as the first step of the pipeline, which is immutable. spacy_symspell generates new text from Segmenting continuous text. It can not generate new tokens in the spacy as spacy already generated tokens. .A spacy pipeline works a token sequence, not a stream of text. One would have to spin off a changed version of spacy. Why bother? Instead, segment continuous text into Corpus of word text. Then correct the text of embedded whitespace in a word and hyphenated words in the text. Do any other raw cleaning you want to do. Then feed the raw text to spacy.
注意:有一个spacy管道实现spacy_symspell,它直接调用SymSpellpy。 我建议您不要使用spacy_symspell。 Spacy首先生成令牌,这是流水线的第一步,这是不可变的。 spacy_symspell从生成新文本 分割连续文本。 由于spacy已生成令牌,因此无法在spacy中生成新令牌。 .A spacy管道工程令牌序列,文本不流。 人们将不得不衍生出一种变化的spacy版本。 何必呢? 相反,段连续的文本到文本字语料库。 然后更正单词中嵌入的空格文本和文本中带连字符的单词。 进行您想做的其他任何原始清洁。 然后将原始文本输入spacy 。
I show spacy_symspell. Again my advice is not to use it.
我展示spacy_symspell。 同样,我的建议是不要使用它。
import spacy
from spacy_symspell import SpellingCorrector
def segment_into_words(input_term):
nlp = spacy.load(“en_core_web_lg”, disable=[“tagger”, “parser”])
corrector = SpellingCorrector()
nlp.add_pipe(corrector)
结论 (Conclusion)
In future blogs, I will detail many common and uncommon Fast Text Pre-Processing Methods. Also, I will show the expected speedup from moving SymSpellpy to cython.
在以后的博客中,我将详细介绍许多常见和不常见的快速文本预处理方法。 另外,我将展示从SymSpellpy迁移到cython的预期加速。
There will be many more formats and APIs you need to support in the world of “Changing X format into a text corpus.”
在“将X格式更改为文本语料库”的世界中,您将需要支持更多的格式和API。
I detailed two of the more common document formats, PDF, and Gutenberg Project formats. Also, I gave two NLP utility functions segment_into_words
and file_or_url.
我详细介绍了两种较常见的文档格式PDF和Gutenberg Project格式。 另外,我提供了两个NLP实用程序功能segment_into_words
和file_or_url.
I hope you learned something and can use some of the code in this blog.
希望您学到了一些知识,并可以使用此博客中的一些代码。
If you have some format conversions or better yet a package of them, let me know.
如果您进行了某些格式转换或更好的转换,请告诉我 。
翻译自: https://towardsdatascience.com/natural-language-processing-in-production-converting-pdf-and-gutenberg-document-formats-into-text-9e7cd3046b33
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/392466.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!