如何识别媒体偏见_描述性语言理解,以识别文本中的潜在偏见

如何识别媒体偏见

TGumGum can do to bring change by utilizing our Natural Language Processing technology to shed light on potential bias that websites may have in their content. The ideas and techniques shared in this blog are a result of the GumGum Hackathon Project: Verity E-Quality (Aditya Ramesh, Erica Nishimura, Ishan Shrivastava, Lane Schechter and Trung Do).

T GumGum可以利用我们的自然语言处理技术来带来变化,从而揭示网站内容可能存在的潜在偏见。 本博客中分享的想法和技术是GumGum Hackathon项目:Verity E-Quality(Aditya Ramesh,Erica Nishimura,Ishan Shrivastava,Lane Schechter和Trung Do)的结果。

In this blog, we will look into how we can utilize and build upon the existing product offering from GumGum to understand the Gender Representation in a website’s content. We aren’t saying that one publisher is more biased than the other, rather we are merely providing the awareness around the representation as it exists. With Natural Language Processing, we can compare between the descriptive language being used around Males and Females to provide this awareness.

在此博客中,我们将研究如何利用和建立GumGum提供的现有产品,以了解网站内容中的性别表示形式。 我们并不是说一个出版商比另一个出版商有更大的偏见,相反,我们只是在提供有关表示形式的意识。 通过自然语言处理,我们可以在男性和女性周围使用的描述性语言之间进行比较,以提供这种意识。

In order to facilitate meaningful change, we need to be aware and mindful of where that change is needed. — Lane Schechter, Product Manager, GumGum Inc.

为了促进有意义的变更,我们需要意识到并铭记需要进行哪些变更。 —口香糖公司产品经理Lane Schechter

口香糖的产品 (GumGum’s Product Offerings)

Before we move ahead to understand how we build upon the existing product offerings, let us first take a brief look at them. GumGum’s Verity Product does a complete contextual analysis of a publisher’s webpage. Some of the key offerings of this product are:

在继续了解如何在现有产品基础上发展之前,让我们首先简要地了解一下它们。 GumGum的Verity产品对发布者的网页进行了完整的上下文分析。 该产品的一些主要产品包括:

  • Contextual Classification & Targeting: This feature identifies and scores publisher’s content (webpages) for contextual classification based on standard IAB Content Taxonomy v1.0 and v2.0. Some of those categories are “Sports”, “Food & Drinks”, “Automotive”, “Medical Health” etc. Going forward, we will refer to them as IAB verticals.

    内容相关分类和定位 :此功能可根据标准IAB内容分类标准v1.0和v2.0对发布者的内容(网页)进行识别和评分,以进行内容相关分类。 其中一些类别是“体育”,“食品和饮料”,“汽车”,“医疗保健”等。展望未来,我们将其称为IAB行业。

  • Brand Safety & Suitability: This feature flags and rates brand safety threats based on GumGum’s proprietary threat classification taxonomy and in compliance with The 4A’s Advertising Assurance Brand Safety Framework.

    品牌安全性和适用性 :此功能基于GumGum专有的威胁分类法并符合4A的广告保证品牌安全框架来标记和评估品牌安全威胁。

  • Named Entity Recognition (NER): This feature identifies and extracts any mention of a named entity in the publisher’s content. A named entity could be any mention of a ‘Person’, ‘Location’ or ‘Organization’.

    命名实体识别(NER) :此功能可以识别并提取发布者内容中对命名实体的任何提及。 命名实体可以是对“人员”,“位置”或“组织”的任何提及。

  • Sentiment Analysis: This feature analyzes the attitudes, opinions and emotions expressed online to provide the most nuanced brand safety and contextual insights.

    情感分析 :此功能可分析在线表达的态度,观点和情感,以提供最细微的品牌安全性和上下文相关见解。

Here is one way we can provide the Descriptive Language Understanding Associated with Gender. We can use the Named Entity Recognition (NER) feature to extract Names of “Person” named entity type which can be used to identify the gender of the person being talked about. We can also use the Sentiment Analysis feature to extract sentiment of the sentences in which Males and Females are being talked about. We can use all of this information to understand the descriptive language being used around Males and Females (more on how to do this in the next section) and compare it across different IAB verticals extracted using our Contextual Classification feature.

这是我们提供与性别相关描述性语言理解的一种方法 我们可以使用命名实体识别( NER )功能来提取“个人”命名实体类型的名称,该名称可用于识别所谈论人员的性别。 我们还可以使用情感分析功能来提取正在谈论男性和女性的句子的情感。 我们可以使用所有这些信息来理解男性和女性周围使用的描述性语言(在下一节中将详细介绍如何操作),并使用上下文分类功能将其与不同的IAB垂直行业进行比较。

与性别相关的描述性语言理解方法 (Approach for Descriptive Language Understanding Associated with Gender)

Image for post
Fig 1: Flowchart diagram describing the approach for Descriptive Language Understanding Associated with Gender
图1:流程图描述了与性别相关的描述性语言理解方法

We start by running a Domain Specific Query on our NLP Databases to extract URL’s for the given publisher. We then utilize the Named Entity Recognition Feature of Verity to filter out pages that do not contain any “Person” Named Entity. From the remaining pages, we extract all “Person Names” and the Sentences in which those “Person Names” occur. As a future step, we can also perform coreference resolution, to extract more sentences where the “Persons” are mentioned using their respective pronouns.

我们首先在NLP数据库上运行特定于域的查询,以提取给定发布者的URL。 然后,我们利用Verity的命名实体识别功能来过滤掉不包含任何“人”命名实体的页面。 从其余页面中,我们提取所有“人名”和出现这些“人名”的句子。 作为未来的步骤,我们还可以执行共指解析,以提取更多句子,并使用各自的代词提及“人物”。

We then use the “Person Names” to detect the gender of the person using an open source package called Gender Guesser. We also use the “Sentences” to extract the sentiment of the Sentence by utilizing our own FastText based Sentiment Classification model. This model is trained on our publisher data which classifies a sentence into Negative, Neutral or Positive Sentiment.

然后,我们使用称为“性别名称”的开源软件包Gender Guesser来检测人员的性别 。 我们还使用“句子”通过利用我们自己的基于FastText的情感分类模型来提取句子的情感。 此模型是根据我们的发布者数据训练的,该数据将句子分为负面,中性或正面情绪。

We also use “Person Names” and the Sentences they occur in to extract Adjectives used in the surrounding context for a given person. To achieve this we used Spacy’s Part of Speech Tokenizer and extract adjectives used within a proximity of a mention of a person name. Consider the example given below:

我们还使用“人物名称”及其出现的句子来提取给定人物在周围环境中使用的形容词。 为了达到这个目的,我们使用了Spacy的语音词性分词器,并提取了在提及某人名时使用的形容词。 考虑下面给出的示例:

Image for post
Fig 2
图2

We use all this information to create a Word Cloud for the Adjectives used around each Gender and Sentiment Pair across the entire content as well as specific to different IAB verticals.

我们使用所有这些信息为整个内容以及特定于不同IAB行业的每个性别和情感对使用的形容词创建词云 。

For example, consider the following four word clouds that we got based on the Adjectives used around Males and Females in a Positive and Negative context extracted from a Publisher’s content:

例如,考虑以下四个词云,这些词云是根据从发布者内容中提取的正面和负面上下文中男性和女性周围使用的形容词得出的:

Image for post
Negative Sentiment否定情感
Image for post
Negative Sentiment感的女性周围的形容词的词云

Nothing stereotypical stands out here. It has similarly or equally negative adjectives being used around Males and Females alike.

没有什么定型观念在这里脱颖而出。 它在男性和女性周围都有相似或同等的否定形容词。

Image for post
Positive Sentiment阳性的男性使用
Image for post
Positive Sentiment积极情绪的女性

What we see here is that more Intellectual Type Adjectives being used around Males, while more Appearance Type Adjectives being used around Females.

我们在这里看到的是,在男性周围使用更多的智力类型形容词,而在女性周围使用更多的外观类型形容词。

It becomes even more clearer if we look at the most frequent Adjectives used around ONLY Males or Female. We do this be considering the top 15 adjectives and extracting only the Uncommon Adjectives between the two genders and compare it among the Positive and Negative Context.

如果我们只看男性或女性周围最常用的形容词,就会更加清楚。 我们这样做是在考虑前15名形容词,并仅提取两个性别之间的不常见形容词 ,然后将其在正面和负面语境中进行比较。

Image for post
Fig 7: Most Frequent Adjectives used for Only Male/Female based on top 15 Adjectives for each Gender corresponding to different Sentiment Context
图7:基于不同情感情境的每个性别的前15个形容词,仅用于男性/女性的最常见形容词

Here we can clearly see that in the Negative context, the most frequent Adjectives used around Only Males and Only Females can be considered equally negative. But in the Positive context, that is clearly not the case. Around Males, we see adjectives like “Proud”, “Sized”, “Perfect”, “Fantastic etc while we see adjectives like “Beautiful”, “Healthy”, “Amazing”, “Sweet”, “Supporting”, “Lucky” etc around Females. This is suggestive of more Intellectual Type Adjectives being used around Males and more Appearance Types adjectives being used around Females.

在这里我们可以清楚地看到,在否定语境中,仅男性和仅女性周围使用最频繁的形容词可被视为同等否定。 但是,在积极方面,情况显然并非如此。 在男性周围,我们看到形容词如“骄傲”,“大小”,“完美”,“棒极了”,而我们看到形容词如“美丽”,“健康”,“惊人”,“甜”,“支持”,“幸运”等女性。 这表明在男性周围使用更多的智力类型形容词,在女性周围使用更多的外观类型形容词。

This sort of analysis of the descriptive language being used around different Genders in different Sentimental Context can really help in understanding what sort of Bias if any is present in a publisher’s content. But how can we quantify this? For this we introduce a Context Based Similarity Score.

对在不同情感环境中不同性别之间使用的描述性语言进行的这种分析,确实可以帮助理解发行人内容中存在的哪种偏差(如有)。 但是我们如何量化呢? 为此,我们介绍了一个基于上下文的相似度评分

基于上下文的相似度评分 (Context Based Similarity Score)

The idea here is to find a way to compute a single score that shows the degree of similarity between the most frequent adjectives used around only Males and only Females. To achieve this we make use of the famous Transformer based Deep Learning model: BERT by Google Research.

这里的想法是找到一种方法来计算单个分数,该分数显示仅在男性和女性之间使用的最常见形容词之间的相似程度。 为此,我们利用了著名的基于Transformer的深度学习模型: Google Research的BERT 。

Among being awesome at a variety of NLP tasks and breaking the State of the Art results on them, BERT is also great at providing Contextualized Word Vector Representations (Embeddings). What that means is that, BERT doesn’t provide a single and constant representation of a word, rather it looks at the context in which the word was used in the sentence and spits out a context sensitive representation of that word. This is particularly useful as it captures more information than other representations such as Word2Vec or Glove. A famous example used to point this out is that BERT will provide different representations for the word “Bank” depending on the context in which it was used. The context could be of a river bank or of a financial bank. Therefore, to extract a word representation from BERT, you need to send a sentence in which it was used to get a Contextualized Word Vector Representations. (Apart from reading their original paper here, you can also look at this and this to get a more visualistic way of understanding Transformers and BERT. )

BERT擅长处理各种NLP任务并打破了最新的技术成果,其中,BERT擅长提供上下文化的词向量表示(嵌入) 。 这就是说,BERT不提供单词的单一且恒定的表示形式,而是查看句子中使用该单词的上下文,并吐出该单词的上下文相关表示形式。 这一点特别有用,因为它比诸如Word2Vec或Glove之类的其他表示形式捕获的信息更多。 指出这一点的一个著名示例是,BERT将根据使用的上下文为“银行”一词提供不同的表示形式。 上下文可以是河岸或金融银行。 因此,要从BERT中提取单词表示形式,您需要发送一个句子,在该句子中使用它来获取上下文化的单词向量表示形式。 (除了这里阅读他们的原始论文,你也可以看看这个和这个得到理解变压器和BERT更visualistic方式。)

Therefore, along with the most frequent Male only and Female Only adjectives, we also extract the sentences in which these Male only and Female Only Adjectives are used. We send these sentences into BERT to extract Contextualized Vector Representations of length 768, for each of these Adjectives based on the context in which these adjectives were used.

因此,与最常见的男性专用和女性专用形容词一起,我们还提取了使用这些男性专用和女性专用形容词的句子。 我们根据使用这些形容词的上下文,将这些句子发送到BERT中,以提取长度为768的上下文化向量表示形式。

We use these representation that have rich context information to compute a Context Based Similarity Score between the Male only Adjectives and Female Only Adjectives used in with Positive or a Negative context. We take the mean of the contextual representations of all Male only Adjectives and Female Only Adjectives to get an averaged representation for all the Male only Adjectives and Female only Adjectives respectively. We then take the cosine similarity between the two vector representations to compute a Context Based Similarity Score as shown in the figure below:

我们使用具有丰富上下文信息的这些表示来计算在正或负上下文中使用的仅男性形容词和仅女性形容词之间的基于上下文的相似性得分 。 我们取所有男性专用形容词和女性专用形容词的上下文表示的平均值,以分别获得所有男性专用形容词和女性专用形容词的平均表示。 然后,我们使用两个向量表示之间的余弦相似度来计算基于上下文的相似度得分,如下图所示:

Image for post
Fig 8: Calculating the Context Based Similarity Score from Contextualized Word Vector Representations of the Adjectives used around only Males and around only Females.
图8:从仅在男性周围和仅在女性周围使用的形容词的上下文化词向量表示形式,计算基于上下文的相似性分数

This score is calculated for a given sentiment and a given IAB vertical.

针对给定的情绪和给定的IAB垂直度计算此分数。

The higher this score, the better is the balance between the Adjectives being used around a particular gender in the context of a given sentiment and given IAB vertical.

该分数越高,在给定的情绪和IAB垂直的情况下针对特定性别使用的形容词之间的平衡就越好。

Let us look at the Context Based Similarity score in action:

让我们看一下基于上下文的相似性得分:

Image for post
Fig 9: The Context Based Similarity Score based on the most Frequent Adjectives used around Only Males and Only Females corresponding to different Sentiment Context
图9:基于上下文的相似性评分,该评分基于对应于不同情感上下文的仅男性和女性周围使用的最常见形容词

Comparing the two scores, we can see that we get a higher score in the case of Negative sentiment, where there were similar kind of Adjectives (equally negative in this case) used around Males and Females. On the other hand, we get a lower score in the case of Positive sentiment, where we did see some form of Bias with Intellectual Type Adjectives being used around Males while Appearance Type Adjectives being used around Females.

比较这两个分数,我们可以发现,在负面情绪的情况下,我们在男性和女性周围使用了相似类型的形容词(在这种情况下,均为负数)时得分更高。 另一方面,在积极情绪的情况下,我们得到了较低的分数,在这种情况下,我们确实看到了某种形式的偏见,其中男性使用智力类型形容词,而女性使用外观类型形容词。

结论 (Conclusion)

In this blog we saw how we can analyze the Descriptive Language used around Males and Females. We analyzed the insights found from such an analysis and saw how it can guide and point us to where the change might be required. We took a look at how GumGum can leverage Product Offerings like Content Classification and Named Entity Recognition from its vast variety of feature arsenal and build upon them to quantify the degree of similarities in the descriptive language being used around Males and Females. As a part of our future works, we can work on identifying Race mentions in a piece of text and easily extend this work to understand the Descriptive Language used around different Races.

在此博客中,我们看到了如何分析男性和女性周围使用的描述性语言。 我们分析了从这种分析中发现的见解,并了解了它如何指导并指出我们可能需要进行更改的地方。 我们研究了GumGum如何利用其功能丰富的功能库中的内容分类和命名实体识别之类的产品,并以此为基础来量化男性和女性使用的描述性语言的相似程度。 作为我们未来工作的一部分,我们可以在一段文字中识别种族提及,并轻松地扩展这项工作以理解围绕不同种族使用的描述性语言。

About Me: Graduated with a Masters in Computer Science from ASU. I am a NLP Scientist at GumGum. I am interested in applying Machine Learning/Deep Learning to provide some structure to the unstructured data that surrounds us.

关于我 :毕业于ASU的计算机科学硕士学位。 我是GumGum的NLP科学家。 我对应用机器学习/深度学习感兴趣,以便为我们周围的非结构化数据提供某种结构。

We’re always looking for new talent! View jobs.

我们一直在寻找新的人才! 查看工作 。

Follow us: Facebook | Twitter | | Linkedin | Instagram

关注我们: Facebook | 推特 | | Linkedin | Instagram

翻译自: https://medium.com/gumgum-tech/descriptive-language-understanding-to-identify-potential-bias-in-text-89936fefbae7

如何识别媒体偏见

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389516.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

分享 : 警惕MySQL运维陷阱:基于MyCat的伪分布式架构

分布式数据库已经进入了全面快速发展阶段。这种发展是与时俱进的,与人的需求分不开,因为现在信息时代的高速发展,导致数据量和交易量越来越大。这种现象首先导致的就是存储瓶颈,因为MySQL数据库实质上还是一个单机版本的数据库&am…

数据不平衡处理_如何处理多类不平衡数据说不可以

数据不平衡处理重点 (Top highlight)One of the common problems in Machine Learning is handling the imbalanced data, in which there is a highly disproportionate in the target classes.机器学习中的常见问题之一是处理不平衡的数据,其中目标类别的比例非常…

最小二乘法以及RANSAC(随机采样一致性)思想及实现

线性回归–最小二乘法(Least Square Method) 线性回归: 什么是线性回归? 举个例子,某商品的利润在售价为2元、5元、10元时分别为4元、10元、20元, 我们很容易得出商品的利润与售价的关系符合直线&#xf…

糖药病数据集分类_使用optuna和mlflow进行心脏病分类器调整

糖药病数据集分类背景 (Background) Data science should be an enjoyable process focused on delivering insights and real benefits. However, that enjoyment can sometimes get lost in tools and processes. Nowadays it is important for an applied data scientist to…

Android MVP 框架

为什么80%的码农都做不了架构师?>>> 前言 根据网络上的MVP套路写了一个辣鸡MVP DEMO 用到的 android studio MVPHelper插件,方便自动生成框架代码rxjavaretrofit什么是MVP MVP就是英文的Model View Presenter,然而实际分包并不是只有这三个包…

相似图像搜索的哈希算法思想及实现(差值哈希算法和均值哈希算法)

图像相似度比较哈希算法: 什么是哈希(Hash)? • 散列函数(或散列算法,又称哈希函数,英语:Hash Function)是一种从任何一种数据中创建小 的数字“指纹”的方法。散列函数把消息或数…

腾讯云AI应用产品总监王磊:AI 在传统产业的最佳实践

欢迎大家前往腾讯云社区,获取更多腾讯海量技术实践干货哦~ 背景:5月23-24日,以“焕启”为主题的腾讯“云未来”峰会在广州召开,广东省各级政府机构领导、海内外业内学术专家、行业大咖及技术大牛等在现场共议云计算与数字化产业创…

Toast源码深度分析

目录介绍 1.最简单的创建方法 1.1 Toast构造方法1.2 最简单的创建1.3 简单改造避免重复创建1.4 为何会出现内存泄漏1.5 吐司是系统级别的 2.源码分析 2.1 Toast(Context context)构造方法源码分析2.2 show()方法源码分析2.3 mParams.token windowToken是干什么用的2.4 schedul…

运行keras出现 FutureWarning: Passing (type, 1) or ‘1type‘ as a synonym of type is deprecated解决办法

运行keras出现 FutureWarning: Passing (type, 1) or ‘1type’ as a synonym of type is deprecated; in a future version of numpy, 原则来说,没啥影响,还是能运行,但是看着难受 解决办法: 点击蓝色的链接: 进入 …

mongdb 群集_群集文档的文本摘要

mongdb 群集This is a part 2 of the series analyzing healthcare chart notes using Natural Language Processing (NLP)这是使用自然语言处理(NLP)分析医疗保健图表笔记的系列文章的第2部分。 In the first part, we talked about cleaning the text and extracting sectio…

keras框架实现手写数字识别

详细细节可学习从零开始神经网络:keras框架实现数字图像识别详解! 代码实现: [1]将训练数据和检测数据加载到内存中(第一次运行需要下载数据,会比较慢): (mnist是手写数据集) train_images是用于训练系统…

gdal进行遥感影像读写_如何使用遥感影像进行矿物勘探

gdal进行遥感影像读写Meet Jose Manuel Lattus, a geologist from Chile. In the latest Soar Cast, he discusses his work in mineral exploration and environmental studies, and explains how he makes a living by creating valuable information products based on diff…

从零开始神经网络:keras框架实现数字图像识别详解!

接口实现可参考:keras框架实现手写数字识别 思路: 我们的代码要导出三个接口,分别完成以下功能: 初始化initialisation,设置输入层,中间层,和输出层的节点数。训练train:根据训练数据不断的更…

推荐算法的先验算法的连接_数据挖掘专注于先验算法

推荐算法的先验算法的连接So here we are diving into the world of data mining this time, let’s begin with a small but informative definition;因此,这一次我们将进入数据挖掘的世界,让我们从一个小的但内容丰富的定义开始; 什么是数…

Tensorflow入门神经网络代码框架

Tensorflow—基本用法 使用图 (graph) 来表示计算任务.在被称之为 会话 (Session) 的上下文 (context) 中执行图.使用 tensor 表示数据.通过 变量 (Variable) 维护状态.使用 feed 和 fetch 可以为任意的操作(arbitrary operation)赋值或者从其中获取数据。 • TensorFlow 是一…

手把手教你把代码丢入github 中

手把手教你把代码丢入github 中 作为一个小运维一步步教你们怎么把代码放入到github 中 首先呢我们下载一个git的客户端 https://git-scm.com/downloads/ 下载一个最新版的2.16.2 下载后那就安装吧。如果看不懂英文就选择默认安装的方式吧。但是你得记住你的软件安装的位置 小…

时间序列模式识别_空气质量传感器数据的时间序列模式识别

时间序列模式识别 1. Introduction 2. Exploratory Data Analysis ∘ 2.1 Pattern Changes ∘ 2.2 Correlation Between Features 3. Anomaly Detection and Pattern Recognition ∘ 3.1 Point Anomaly Detection (System Fault) ∘ 3.2 Collective Anomaly Detection (Externa…

oracle 性能优化 07_诊断事件

2019独角兽企业重金招聘Python工程师标准>>> 一、诊断事件 诊断事件无官方技术文档支持,使用存在风险,慎用。使用诊断事件可以获取问题更多的信息,调整系统运行 特性,启用某些内部功能。用于系统故障的诊断。跟踪应…

Tensorflow框架:卷积神经网络实战--Cifar训练集

Cifar-10数据集包含10类共60000张32*32的彩色图片,每类6000张图。包括50000张训练图片和 10000张测试图片 代码分为数据处理部分和卷积网络训练部分: 数据处理部分: #该文件负责读取Cifar-10数据并对其进行数据增强预处理 import os impo…

linux内存初始化初期内存分配器——memblock

2019独角兽企业重金招聘Python工程师标准>>> 1.1.1 memblock 系统初始化的时候buddy系统,slab分配器等并没有被初始化好,当需要执行一些内存管理、内存分配的任务,就引入了一种内存管理器bootmem分配器。 当buddy系统和slab分配器初始化好后&…