分词消除歧义

折磨数据，它将承认任何事情 (Torture the data, and it will confess to anything)

Disambiguation as defined in the vocabulary.com dictionary refers to the removal of ambiguity by making something clear and narrowing down its meaning. Whilst data disambiguation is not an easy task, it is essential for all language processing and is directly correlated to perceived data quality.

vocabulary.com词典中定义的歧义消除是指通过明确某些内容并缩小其含义来消除歧义。消除数据歧义虽然不是一件容易的事，但它对于所有语言处理都是必不可少的，并且与感知的数据质量直接相关。

In data integration where the goal is to consolidate data from disparate sources into a single homogenized set. The ultimate goal is to provide users with consistent access and delivery of data. However, real-world data is messy, inconsistent, and ambiguous. As a result, it needs to be processed and massaged to maximize its effectiveness. Disambiguation provides a framework in which integrating data and transforming it into a consistent format is scalable. This transformation generally involves creating a common vocabulary and a framework to extract valuable information out of the noise and added complexity.

在数据集成中，目标是将来自不同来源的数据合并到一个同质化的集中。最终目标是为用户提供一致的数据访问和传递。但是，现实世界中的数据是混乱，不一致和模棱两可的。结果，需要对其进行处理和按摩以使其效力最大化。歧义消除提供了一个框架，在该框架中可以集成数据并将其转换为一致的格式。这种转换通常涉及创建通用词汇表和框架以从噪声和增加的复杂性中提取有价值的信息。

招聘世界中的数据 (Data In The Recruitment World)

Data in the recruitment world consists of a mixture of structured and unstructured information of varying lengths e.g., job description, curriculum vitae, cover letter, etc. At Beamery, we have implemented various mechanisms to extract and structure information from these textual sources e.g., role titles, skills, experience descriptions, and company names. Among those, role titles stand out as the key that draws the lines around the skills and the knowledge the individual possesses.

招聘世界中的数据由不同长度的结构化和非结构化信息组成，例如职位描述，履历，求职信等。在Beamery，我们实施了各种机制来从这些文本来源中提取和构建信息，例如角色头衔，技能，经验描述和公司名称。在这些角色中，角色头衔脱颖而出，成为划定个人拥有的技能和知识的关键。

Taking advantage of the rich background information around a role title is useful; however, it is also one of the harder information to deal with in the recruitment space. They are ever-changing, prone to typos, can provide additional knowledge such as seniority or location on top of the main phrase defining the work. Moreover, you are likely to encounter synonyms expressing the same set of skills and experience, pointing at the same semantic object. For example, “software engineer”, “software developer” and “software ninja” can be used interchangeably, all representing the same underlying experience. In this work, we are making a distinction between disambiguation and similarity. The following process does not make judgments about the similarities between role titles. The aim is to disambiguate the textual information without losing vital details that shape the expectations from the role title.

利用角色标题周围的丰富背景信息很有用；但是，它也是招聘空间中较难处理的信息之一。他们瞬息万变，容易出现错别字，可以在定义工作的主要短语之上提供其他知识，例如资历或位置。此外，您可能会遇到表示相同技能和经验的同义词，指向相同的语义对象。例如，“软件工程师”，“软件开发人员”和“软件忍者”可以互换使用，它们都代表相同的基础体验。在这项工作中，我们正在区分歧义和相似性。以下过程不会判断角色标题之间的相似性。目的是消除文本信息的歧义，而又不丢失影响角色标题期望的重要细节。

问题 (The Problem)

Before we go into detail, it is important to paint a clear picture of the problem. We have sampled around 10 million distinct anonymous contacts from Beamery’s database. These “contacts” are individuals that have been in contact with our client companies for a job position. Role titles are among the data stored about these individuals. When we created the list of role titles from this sample, we were shocked by the staggering count of 7 million distinct role titles. Our clients hire from different nationalities and countries, so we would expect the inclusion of different languages to compound the final number. However, this still indicates there is a clear data cleanliness problem that may be caused by typos or other technical errors in CV parsing or 3rd party integration systems.

在详细介绍之前，重要的是要清楚地描述问题。我们从Beamery的数据库中抽取了大约1000万个不同的匿名联系人。这些“联系人”是与我们的客户公司联系以寻求工作职位的个人。角色标题是有关这些人的存储数据。当我们从该示例创建角色标题列表时，我们为700万个不同的角色标题而感到震惊。我们的客户从不同的国籍和国家/地区聘用，因此我们希望包含不同的语言来组合最终的数量。但是，这仍然表明存在明显的数据清洁性问题，这可能是由于简历解析或第三方集成系统中的错别字或其他技术错误引起的。

Thinking about reducing the complexity in the role title space, an intuitive approach would be to map the raw role titles to a curated version, preferably from a taxonomy of role titles. This way we would reduce the diversity in 7 million role titles and translate them to their counterparts in a known space. There are public taxonomies available such as the efforts by ESCO and O*Net. It seems enticing at first to take advantage of a work that is incredibly costly both in experts’ time and money. Yet, mapping to a taxonomy proposes a non-trivial search problem and this isn’t really the first step into dealing with role titles. It became clear to us that we needed to deconstruct, clean, and understand the building blocks before moving into mapping or other downstream efforts.

考虑降低角色标题空间的复杂性，一种直观的方法是将原始角色标题映射到精选版本，最好是从角色标题的分类法中进行映射。这样，我们将减少700万个角色标题的多样性，并将其转换为已知空间中的对应角色。有可用的公共分类法，例如ESCO和O * Net的努力。乍看起来似乎很诱人，因为这既花费了专家的时间和金钱，又花费了巨额成本。但是，映射到分类法会提出一个不平凡的搜索问题，而这实际上并不是处理角色标题的第一步。我们已经清楚地知道，在进行制图或其他下游工作之前，我们需要解构，清理和理解这些构造块。

解决方案 (The Solution)

The response came as a multi-step disambiguation framework that features a role title vocabulary. There are two main parts to the process: cleaning and feature extraction. Cleaning includes basic preprocessing, spelling correction, and token removal by discarding out of vocabulary words. The outcome of the cleaning step is the “disambiguated title”. Feature extraction is focused on transforming a string into a set of features such as seniority levels and a set of phrases.

响应是一个多步骤的歧义消除框架，该框架具有角色标题词汇。该过程包括两个主要部分：清洁和特征提取。清除包括基本的预处理，拼写更正和通过丢弃词汇表单词来除去标记。清洁步骤的结果是“标题明确”。特征提取专注于将字符串转换为一组特征，例如资历级别和一组短语。

词汇 (Vocabulary)

In the core of the disambiguation process lies the role title vocabulary. Vocabularies allow us to define the boundaries of an entity by offering a set of acceptable building blocks. In this case, words are the building blocks for role titles. Common words such as manager, director, senior, specialist are the usual suspects. We also have words that represent expertise. For example, “scientist” is a broad term that defines a set of skills. “Research scientist” indicates that the role likely to belong in academia whereas a “data scientist” is likely to work in a commercial company. All of these words are modifying the meaning of the role title heavily. But how would we decide whether a word belongs in a role title and its presence enriches the role titles meaning? We have chosen commonality as the acceptance criteria for the vocabulary. Selecting a count threshold for acceptance of a word is a balancing act. Lower count threshold means a bigger vocabulary, resulting disambiguation will be less aggressive but it will have a higher coverage. A higher threshold means a smaller vocabulary. Coverage will suffer but disambiguation can be more robust. It carries the risk of inviting many false positives.

消歧过程的核心是角色标题词汇。词汇表允许我们通过提供一组可接受的构建块来定义实体的边界。在这种情况下，单词是角色标题的基础。经理，董事，高级，专家等常用词是常见的嫌疑人。我们也有代表专业知识的单词。例如，“科学家”是定义一组技能的广义术语。 “研究科学家”表示该角色可能属于学术界，而“数据科学家”则可能在商业公司中工作。所有这些词都在很大程度上改变角色标题的含义。但是，我们如何确定一个单词是否属于角色标题并且其出现丰富了角色标题的含义呢？我们选择了通用性作为词汇表的接受标准。选择用于接受单词的计数阈值是一种平衡行为。较低的计数阈值意味着词汇量更大，因此消除歧义的积极性会降低，但覆盖率会更高。阈值越高，词汇量越少。覆盖范围会受到影响，但歧义消除可能会更可靠。冒着引起许多误报的风险。

We have created a role title vocabulary of 8,330 words with a threshold of 100. It covers 96% of role titles that have a frequency of at least 5 times out of 28 million instances. The main assumption here is that any word that is not a part of the vocabulary is either noise, typo (if we failed to correct it), or so obscure that downstream models/processes cannot make sense of it.

我们创建的角色标题词汇量为8,330个单词，阈值为100。它涵盖了2800万个实例中出现频率至少为5倍的96％的角色标题。这里的主要假设是，不属于词汇表的任何单词都是噪音，错别字(如果我们无法纠正它的话)，或者太晦涩难懂，以至于下游模型/过程无法理解它。

标题明确 (Disambiguated Title)

An important step in the cleaning process is creating a “fingerprint” of the role title. After preprocessing, spelling correction, and token removal, we create an ID of the role title with remaining tokens. We have been influenced by the simple but efficient approach taken in clustering in OpenRefine that is simply ordering the words alphabetically and keeping only the unique ones. Using the “fingerprint”, we can group role titles sharing the same ID and assign a “disambiguated title” by taking the most common version.

清理过程中的一个重要步骤是创建角色标题的“指纹”。经过预处理，拼写更正和令牌删除之后，我们创建了带有剩余令牌的角色标题ID。我们受到OpenRefine群集中采用的简单而有效的方法的影响，该方法只是简单地按字母顺序排列单词并仅保留唯一的单词。使用“指纹”，我们可以将具有相同ID的角色标题分组，并通过采用最常见的版本来分配“歧义标题”。

Image for post — Sharing “Senior Data Scientist” as “disambiguated title”

短语检测 (Phrase Detection)

Role titles have different components and can include a few roles separated listed together as in “Founder and Chief Executive Officer”. If we were to aim for a one-to-one mapping to a curated list of role titles, we would always lose vital information or would be forced to make arbitrary decisions to choose one role title to map to.

角色标题具有不同的组成部分，可以包括几个单独列出的角色，如“创始人和首席执行官”中列出。如果我们打算与角色标题的精选列表进行一对一映射，那么我们总是会丢失重要的信息，或者被迫做出任意决定来选择要映射的角色标题。

Instead of mapping raw role titles to a curated set of role titles, we have chosen to deconstruct the role title to list of words and phrases. Understanding the vocabulary will enable us to work with any role title spanned by these entities. This is very similar to the way a human would understand any written text. Instead of mapping every possible sentence to an instance in our memory, we learn the words and the grammar. This is the type of understanding that will be enabled by our process.

我们没有选择将原始角色标题映射到一组精选的角色标题，而是选择将角色标题解构为单词和短语列表。了解词汇表将使我们能够处理这些实体所跨越的任何角色标题。这与人类理解任何书面文本的方式非常相似。我们不是在记忆中将每个可能的句子映射到一个实例，而是学习单词和语法。这是我们的流程将支持的理解类型。

To capture the phrases, we have trained an n-gram language model to qualify phrase candidates found in the role titles. The example role title below holds pockets of information that contain valuable information in assessing the expertise of the individual. We could map this role to a “Vice President” but in the process, we would lose most of the context. Instead, we are qualifying the phrases and keeping them as a set of tags. Features extracted from this role title, seniority, phrases, and the disambiguated title together would capture the complete context but we could use any of the features individually depending on the nature of the downstream solution.

为了捕获短语，我们已经训练了一个n语法语言模型来限定在角色标题中找到的短语候选者。下面的示例角色标题包含一些信息包，其中包含宝贵的信息，可用于评估个人的专业知识。我们可以将此角色映射为“副总统”，但在此过程中，我们将失去大部分背景。相反，我们对短语进行限定，并将其保留为一组标签。从此角色标题，资历，短语和歧义标题中提取的功能将共同捕获完整的上下文，但是我们可以根据下游解决方案的性质单独使用任何功能。

角色标题消歧过程 (Role Title Disambiguation Process)

The diagram below shows the different steps in the disambiguation process.

下图显示了消歧过程中的不同步骤。

Start with the raw role title

从原始角色标题开始

Detect the language of the role title

检测角色标题的语言

Every other step in the process depends on the language of the role title. Starting with spelling correction, a vocabulary of words and phrases, and seniority dictionary changes with the language. Therefore, detecting the language at the beginning is vital.

过程中的其他每个步骤都取决于角色标题的语言。 从拼写校正开始，单词和短语的词汇以及资历词典随语言而变化。 因此，一开始就检测语言至关重要。

Preprocessing includes dealing with non-Latin characters, expanding acronyms, removing punctuation, and removing whitespace.

预处理包括处理非拉丁字符，扩展首字母缩略词，删除标点符号和删除空格。

The spelling correction step allows us to catch any spelling errors before we look for the vocabulary words in the role title.

拼写校正步骤使我们能够在查找角色标题中的词汇之前捕获所有拼写错误。

Token removal works on the assumption that words that are left out of the vocabulary are irrelevant to the granularity we are aiming for.

删除标记的前提是，词汇表中遗漏的单词与我们要达到的粒度无关。

Fingerprinting is creating a unique representation for the role title that will be as an ID.

指纹识别将为角色标题创建唯一的表示形式，并将其作为ID。

Disambiguated title is a clean version of the role title and it’s shared by all the role titles sharing the same fingerprint.

歧义标题是角色标题的干净版本，并且所有共享相同指纹的角色标题都将其清除。

Seniority detection is the process of looking for seniority terms inside the role title. If found, seniority is extracted as a new feature from the role title.

资历检测是在角色标题中查找资历术语的过程。如果找到，从角色标题中提取资历作为新功能。

Phrase detection step makes use of an n-gram language model to assign probabilities to word groups and qualify them in their ability to represent the role title.

短语检测步骤利用n-gram语言模型为单词组分配概率，并使它们具有代表角色标题的能力。

After completion of the steps gives above, we end up with a list of features for a role title. Instead of establishing a one-to-one mapping, we have created a structure that captures the information available in the role title. Using this clean data, we can move to a structure where extracted entities are points in a vector space where we can infer relationships between them, getting us closer to achieving the “understanding” that we are looking for.

完成上面给出的步骤后，我们最终获得了角色标题的功能列表。我们没有建立一对一的映射，而是创建了一个捕获角色标题中可用信息的结构。使用这些干净的数据，我们可以移动到一个结构，在该结构中，提取的实体是向量空间中的点，在这里我们可以推断它们之间的关系，从而使我们更接近实现我们所寻找的“理解”。

评价 (Evaluation)

Concepting and creating such a process is valuable; however, adoption and consistent value creation depend on proving value and improvement. Such a process with many rules of varying complexity requires a lot of care. The identification of edge cases and failings is very significant. The stakeholders need to acknowledge that this is an iterative process. The failed edge cases will be input to learnings and over time the output quality will increase.

构思和创建这样的过程很有价值；但是，采用和持续的价值创造取决于证明的价值和改进。这种具有许多复杂度不同的规则的过程需要很多注意。边缘情况和故障的识别非常重要。利益相关者需要承认这是一个反复的过程。失败的边缘案例将被输入到学习中，随着时间的流逝，输出质量将提高。

For this reason, we have created an internal evaluation UI. This allowed us to recruit a group of testers and ask them to go through a set of role titles and check the outcome of the modules at each step. The feedback exposes the potential shortcomings of the process but equally importantly it gives us a ground truth set for quantitative testing. This way, we can measure the performance of individual modules every time we release a new version.

因此，我们创建了一个内部评估UI。这使我们能够招募一组测试人员，并要求他们完成一组角色标题并在每个步骤中检查模块的结果。反馈暴露了该过程的潜在缺陷，但同样重要的是，它为我们提供了定量测试的基础事实。这样，每次发布新版本时，我们就可以衡量各个模块的性能。

进一步的工作 (Further Work)

We are aware that rule-based modules cannot always capture the complexity of human-level tasks. However, the current performance gives us a competent baseline to beat. Depending on the criticality of the tasks and the performance expectations we will prioritize the improvement efforts.

我们知道，基于规则的模块无法始终捕获人员级任务的复杂性。但是，当前的表现为我们提供了一个可以胜任的基准。根据任务的关键程度和性能期望，我们将优先考虑改进工作。

A possible addition to the process is the detection of different entities such as location, company names, software/technology names. Currently, we are choosing to discard location and company names from the role title vocabulary. However, with enough training data, we should be able to train a performant named entity recognition model that can recognize seniority terms as well.

对该过程的可能补充是检测不同的实体，例如位置，公司名称，软件/技术名称。当前，我们正在选择从角色标题词汇中删除位置和公司名称。但是，有了足够的训练数据，我们应该能够训练出一个可以识别资历条件的业绩型实体识别模型。

The spelling correction module depends on a select dictionary of words and phrases. However, we can get better results if we are to leverage a multilingual dataset of spelling mistakes. If we fail to correctly assign the language of the role title, we start incorrectly processing foreign language words for spelling correction as they are not present in the vocabulary.

拼写校正模块取决于选择的单词和短语词典。但是，如果我们要利用多语言的拼写错误数据集，则会获得更好的结果。如果我们未能正确分配角色标题的语言，我们将开始不正确地处理外语单词以进行拼写纠正，因为这些外来单词不存在于词汇表中。

Another important improvement area is phrase detection. We have started with a baseline model to score the phrases; however, language modeling is one of the most popular research areas. As long as we have a large enough dataset of role titles, we can allow deep networks to learn the grammar dictating the structure of the role title and the semantic world behind it. Yet, context and progression of role titles can be harder to capture.

另一个重要的改进领域是短语检测。我们从基线模型开始对短语进行评分；但是，语言建模是最受欢迎的研究领域之一。只要我们有足够大的角色标题数据集，我们就可以允许深度网络学习指示角色标题的结构及其背后的语义世界的语法。但是，角色标题的上下文和进度可能更难捕捉。

结论 (Conclusion)

This is our response to the diversity and noise in role title space. Iterative improvement is at the heart of this process and such an effort needs time to mature. This is one of the earlier steps in the data journey to build a platform to base future efforts. It would certainly help to contextualize the data problem with the problems of the business in order to keep it prioritized and supported. In our experience, we have seen that the business highly supports the disambiguation efforts as long as the context and the nature of the solution is well communicated.

这是我们对角色标题空间中多样性和噪音的回应。迭代式改进是此过程的核心，而这种努力需要时间才能成熟。这是数据之旅中较早的步骤之一，目的是建立一个平台来为将来的工作奠定基础。当然，这将有助于将数据问题与业务问题联系起来，以保持其优先级并得到支持。根据我们的经验，只要解决方案的上下文和性质得到了很好的交流，企业就高度支持消除歧义的工作。

We capture the gist of the solution under the umbrella term “disambiguation”, however, we respond to many different problems with every module. It’s very significant that every module gets enough attention in evaluation and improvement. In the end, a chain is as strong as its weakest link.

我们在“歧义消除”这个笼统的术语下抓住了解决方案的要点，但是，我们对每个模块都回答了许多不同的问题。每个模块在评估和改进中得到足够的重视非常重要。最后，一条链与其最薄弱的环节一样牢固。

We hope that our story in creating a disambiguation process can inspire you to address similar problems. In the series that follows we will continue with posts regarding our progress and provide in-depth information on the individual modules.

我们希望我们在创建歧义消除过程中的故事能够启发您解决类似的问题。在接下来的系列中，我们将继续发布有关我们进度的文章，并提供有关各个模块的深入信息。