深度学习数据集中数据差异大_使用差异隐私来利用大数据并保留隐私

深度学习数据集中数据差异大

The modern world runs on “big data,” the massive data sets used by governments, firms, and academic researchers to conduct analyses, unearth patterns, and drive decision-making. When it comes to data analysis, bigger can be better: The more high-quality data is incorporated, the more robust the analysis will be. Large-scale data analysis is becoming increasingly powerful thanks to machine learning and has a wide range of benefits, such as informing public-health research, reducing traffic, and identifying systemic discrimination in loan applications.

现代世界基于“大数据”,这是政府,企业和学术研究人员用来进行分析,挖掘模式和推动决策的海量数据集。 当涉及到数据分析时,越大越好:合并的数据越多,分析就越鲁棒。 得益于机器学习,大规模数据分析正变得越来越强大,并具有广泛的好处,例如为公共卫生研究提供信息,减少流量以及识别贷款申请中的系统性歧视。

But there’s a downside to big data, as it requires aggregating vast amounts of potentially sensitive personal information. Whether amassing medical records, scraping social media profiles, or tracking banking and credit card transactions, data scientists risk jeopardizing the privacy of the individuals whose records they collect. And once data is stored on a server, it may be stolen, shared, or compromised. “Improper disclosure of such data can have adverse consequences for a data subject’s private information, or even lead to civil liability or bodily harm,” explains data scientist An Nguyen, in his article, “Understanding Differential Privacy.”

但是大数据有一个缺点,因为它需要汇总大量潜在的敏感个人信息。 无论是收集病历,抓取社交媒体资料还是跟踪银行和信用卡交易,数据科学家都有可能危及收集其记录的个人的隐私。 数据一旦存储在服务器上,就可能被盗,共享或泄露。 数据科学家An Nguyen在他的文章“ 了解差异隐私 ”中解释说:“不当披露此类数据可能会对数据主体的私人信息产生不利影响,甚至导致民事责任或人身伤害。”

Computer scientists have worked for years to try to find ways to make data more private, but even if they attempt to de-identify data — for example, by removing individuals’ names or other parts of a data set — it is often possible for others to “connect the dots” and piece together information from multiple sources to determine a supposedly anonymous individual’s identity (via a so-called re-identification or linkage attack).

计算机科学家已经工作了多年,试图找到使数据更具私密性的方法,但是即使他们试图去识别数据(例如,通过删除个人的姓名或数据集的其他部分),对于其他人也通常是可能的。 “点点滴滴”,并将来自多个来源的信息拼凑起来,以确定所谓的匿名个人的身份(通过所谓的重新识别或链接攻击)。

Fortunately, in recent years, computer scientists have developed a promising new approach to privacy-preserving data analysis known as “differential privacy” that allows researchers to unearth the patterns within a data set — and derive observations about the population as a whole — while obscuring the information about each individual’s records.

幸运的是,近年来,计算机科学家为保护隐私的数据分析开发了一种很有前途的新方法,称为“差异性隐私”,它使研究人员能够挖掘数据集内的模式,并获得有关总体人口的观察结果,同时又能掩盖整体情况。有关每个人的记录的信息。

解决方案:差异隐私 (The solution: differential privacy)

Differential privacy (also known as “epsilon indistinguishability”) was first developed in 2006 by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. In a 2016 lecture, Dwork defined differential privacy as being achieved when “the outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the dataset.”

差异性隐私(也称为“ε不可区分性”)由Cynthia Dwork,Frank McSherry,Kobbi Nissim和Adam Smith 于2006年首次开发 。 在2016年的一次演讲中,Dwork 将差异性隐私定义为“当任何分析的结果基本上具有同等可能性,而与任何个人加入还是拒绝加入数据集无关时”。

How is this possible? Differential privacy works by adding a pre-determined amount of randomness, or “noise,” into a computation performed on a data set. As an example, imagine if five people submit “yes” or “no” about a question on a survey, but before their responses are accepted, they have to flip a coin. If they flip heads, they answer the question honestly. But if they flip tails, they have to re-flip the coin, and if the second toss is tails, they respond “yes,” and if heads, they respond “no” — regardless of their actual answer to the question.

这怎么可能? 差分隐私通过将预定量的随机性或“噪声”添加到对数据集执行的计算中而起作用。 例如,假设有五个人对调查中的一个问题回答“是”或“否”,但是在他们的回答被接受之前,他们必须掷硬币。 如果他们低着头,他们会诚实地回答这个问题。 但是,如果他们甩尾巴,就必须重新掷硬币;如果第二次抛硬币是尾巴,则他们回答“是”,如果是抛头,则他们回答“否”,而不管他们对问题的实际回答如何。

As a result of this process, we would expect a quarter of respondents (0.5 x 0.5 — those who flip tails and tails) to answer “yes,” even if their actual answer would have been “no”. With sufficient data, the researcher would be able to factor in this probability and still determine the overall population’s response to the original question, but every individual in the data set would be able to plausibly deny that their actual response was included.

作为此过程的结果,我们希望四分之一的受访者(0.5 x 0.5-那些甩尾巴的人)回答“是”,即使他们的实际回答是“否”。 有了足够的数据,研究人员将能够考虑这一可能性,并且仍然可以确定总体人群对原始问题的回答,但是数据集中的每个人都可以合理地否认包括了他们的实际回答。

Of course, researchers don’t actually use coin tosses and instead rely on algorithms that, based on a pre-determined probability, similarly alter some of the responses in the data set. The more responses are changed by the algorithm, the more the privacy is preserved for the individuals in the data set. The trade-off, of course, is that as more “noise” is added to the computation — that is, as a greater percentage of responses are changed — then the accuracy of the data analysis goes down.

当然,研究人员实际上并没有使用抛硬币,而是依靠基于预定概率的算法来类似地更改数据集中的某些响应。 该算法更改的响应越多,为数据集中的个人保留的隐私越多。 当然,要权衡的是,随着计算中添加更多“噪声”(即,随着更大百分比的响应发生变化),数据分析的准确性就会下降。

When Dwork and her colleagues first defined differential privacy, they used the Greek symbol ε, or epsilon, to mathematically define the privacy loss associated with the release of data from a data set. This value defines just how much differential privacy is provided by a particular algorithm: The lower the value of epsilon, the more each individual’s privacy is protected. The higher the epsilon, the more accurate the data analysis — but the less privacy is preserved.

当Dwork和她的同事首次定义差异隐私时,他们使用希腊符号ε或epsilon在数学上定义与从数据集中释放数据相关的隐私损失。 此值仅定义特定算法提供多少差异隐私:epsilon值越低,每个人的隐私受到保护的程度就越高。 ε越高,数据分析越准确-但是保留的隐私越少。

image showing that a lower epsilon value results in greater privacy — but lower accuracy
A lower epsilon value results in greater privacy — but lower accuracy
较低的ε值会带来更大的隐私权,但准确性会降低

When the data is perturbed (i.e. the “noise” is added) while still on a user’s device, it’s known as local differential privacy. When the noise is added to a computation after the data has been collected, it’s called central differential privacy. With this latter method, the more you query a data set, the more information risks being leaked about the individual records. Therefore, the central model requires constantly searching for new sources of data to maintain high levels of privacy.

当数据仍在用户设备上而受到干扰(即添加了“噪声”)时,称为本地差异隐私。 在收集数据后将噪声添加到计算中时,这称为中央差分隐私。 使用后一种方法,您查询数据集的次数越多,就越有可能泄露有关各个记录的信息。 因此,中心模型要求不断搜索新的数据源以保持高度的隐私。

Either way, a key goal of differential privacy is to ensure that the results of a given query will not be affected by the presence (or absence) of a single record. Differential privacy also makes data less attractive to would-be attackers and can help prevent them from connecting personal data from multiple platforms.

无论哪种方式,差异隐私的主要目标都是确保给定查询的结果不会受到单个记录的存在(或不存在)的影响。 差异性隐私还使数据对潜在的攻击者的吸引力降低,并且可以帮助防止他们连接来自多个平台的个人数据。

实践中的差异隐私 (Differential privacy in practice)

Differential privacy has already gained widespread adoption by governments, firms, and researchers. It is already being used for “disclosure avoidance” by the U.S. census, for example, and Apple uses differential privacy to analyze user data ranging from emoji suggestions to Safari crashes. Google has even released an open-source version of a differential privacy library used in many of the company’s core products.

差异性隐私已被政府,公司和研究人员广泛采用。 例如, 美国人口普查已经将其用于“避免泄露”,而Apple使用差异隐私来分析用户数据,从表情符号建议到Safari崩溃。 谷歌甚至发布了该公司许多核心产品中使用的差异隐私库的开源版本。

Using a concept known as “elastic sensitivity” developed in recent years by researchers at UC Berkeley, differential privacy is being extended into real-world SQL queries. The ride-sharing service Uber adopted this approach to study everything from traffic patterns to drivers’ earnings, all while protecting users’ privacy. By incorporating elastic sensitivity into a system that requires massive amounts of user data to connect riders with drivers, the company can help protect its users from a snoop.

加州大学伯克利分校的研究人员近年来使用一种称为“弹性敏感性”的概念,将差分隐私扩展到了实际SQL查询中。 乘车共享服务Uber采用这种方法研究了从交通方式到驾驶员收入的所有内容,同时保护了用户的隐私。 通过将弹性敏感度纳入需要大量用户数据才能将骑手与驾驶员连接起来的系统,该公司可以帮助保护其用户免遭窥探。

Consider, for example, how implementing elastic sensitivity could protect a high-profile Uber user, such as Ivanka Trump. As Andy Greenberg wrote in Wired: “If an Uber business analyst asks how many people are currently hailing cars in midtown Manhattan — perhaps to check whether the supply matches the demand — and Ivanka Trump happens to requesting an Uber at that moment, the answer wouldn’t reveal much about her in particular. But if a prying analyst starts asking the same question about the block surrounding Trump Tower, for instance, Uber’s elastic sensitivity would add a certain amount of randomness to the result to mask whether Ivanka, specifically, might be leaving the building at that time.”

例如,考虑实现弹性敏感性如何保护著名的Uber用户,例如Ivanka Trump。 正如安迪·格林伯格(Andy Greenberg)在《 连线 》中写道:“如果一个优步业务分析师询问目前曼哈顿中城有多少人在叫车(也许是为了检查供应量是否符合需求),而伊万卡·特朗普当时恰好要求一个优步,答案就不会特别是没有透露太多关于她的信息。 但是,例如,如果一个撬动的分析师开始对特朗普大厦周围的街区提出相同的问题,那么优步的弹性敏感性将给结果增加一定程度的随机性,以掩盖伊万卡是否特别是在那时可能离开建筑物。”

Still, for all its benefits, most organizations are not yet using differential privacy. It requires large data sets, it is computationally intensive, and organizations may lack the resources or personnel to deploy it. They also may not want to reveal how much private information they’re using — and potentially leaking.

尽管有其所有优点,但大多数组织仍未使用差异隐私。 它需要大量的数据集,计算量很大,并且组织可能缺乏部署它的资源或人员。 他们可能也不想透露正在使用多少私人信息,并且有可能泄露信息。

Another concern is that organizations that use differential privacy may be overstating how much privacy they’re providing. A firm may claim to use differential privacy, but in practice could use such a high epsilon value that the actual privacy provided would be limited.

另一个担忧是,使用差异隐私的组织可能夸大了他们提供的隐私数量。 公司可能声称使用差别隐私,但实际上可能会使用很高的ε值,以致实际提供的隐私将受到限制。

Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community.

考虑到这些实施细节的重要性,需要在不同的隐私社区之间共享学习。

To address whether differential privacy is being properly deployed, Dwork, together with UC Berkeley researchers Nitin Kohli and Deirdre Mulligan, have proposed the creation of an “Epsilon Registry” to encourage companies to be more transparent. “Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community,” they wrote in the Journal of Privacy and Confidentiality. “To serve these purposes, we propose the creation of the Epsilon Registry — a publicly available communal body of knowledge about differential privacy implementations that can be used by various stakeholders to drive the identification and adoption of judicious differentially private implementations.”

为了解决差异性隐私是否得到适当部署,Dwork与加州大学伯克利分校的研究人员Nitin Kohli和Deirdre Mulligan共同建议创建“ Epsilon注册中心”,以鼓励公司提高透明度。 他们在《隐私与机密性杂志 》上写道: “鉴于这些实施细节的重要性,因此需要在不同的隐私社区之间进行共同学习。” “为实现这些目的,我们建议创建Epsilon注册管理机构-一个公开的公共社区,以了解差异隐私实施的知识,各种利益相关者可以使用该知识体系来推动识别和采用明智的差异私有实施。”

As a final note, organizations should not rely on differential privacy alone, but rather should use it as just one defense in a broader arsenal, alongside other measures, like encryption and access control. Organizations should disclose the sources of data they’re using for their analysis, along with what steps they’re taking to protect that data. Combining such practices with differential privacy with low epsilon values will go a long way in helping to realize the benefits of “big data” while reducing the leakage of sensitive personal data.

最后要说明的是,组织不应仅依赖于差异性隐私,而应将其用作更广泛的武器库中的一种防御措施,以及诸如加密和访问控制之类的其他措施。 企业应披露其用于分析的数据源,以及他们将采取哪些步骤来保护这些数据。 将此类做法与具有较低epsilon值的差异性隐私相结合,将有助于帮助实现“大数据”的好处,同时减少敏感个人数据的泄漏。

This article was cross-posted by Brookings TechStream. The video was animated by Annalise Kamegawa. The Center for Long-Term Cybersecurity would like to thank Nitin Kohli, PhD student in the UC Berkeley School of Information, and Paul Laskowski, Assistant Adjunct Professor in the UC Berkeley School of Information, for providing their expertise to review this video and article.

本文由 Brookings TechStream 交叉发布 该视频由 Annalize Kamegawa 制作动画 长期网络安全中心要感谢加州大学伯克利分校信息学院的博士生Nitin Kohli和加州大学伯克利分校信息学院的助理兼职教授Paul Laskowski,他们提供了专业知识来审阅此视频和文章。

翻译自: https://medium.com/cltc-bulletin/using-differential-privacy-to-harness-big-data-and-preserve-privacy-349d84799862

深度学习数据集中数据差异大

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388759.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

C#图片处理基本应用(裁剪,缩放,清晰度,水印)

前言 需求源自项目中的一些应用,比如相册功能,通常用户上传相片后我们都会针对该相片再生成一张缩略图,用于其它页面上的列表显示。随便看一下,大部分网站基本都是将原图等比缩放来生成缩略图。但完美主义者会发现一些问题&#…

Java客户端访问HBase集群解决方案(优化)

测试环境&#xff1a;IdeaWindows10 准备工作&#xff1a; <1>、打开本地 C:\Windows\System32\drivers\etc&#xff08;系统默认&#xff09;下名为hosts的系统文件&#xff0c;如果提示当前用户没有权限打开文件&#xff1b;第一种方法是将hosts文件拖到桌面进行配置后…

WPF布局系统

WPF之路——WPF布局系统 前言 前段时间忙了一阵子Google Earth&#xff0c;这周又忙了一阵子架构师论文开题报告&#xff0c;现在终于有时间继续<WPF之路>了。先回忆一下上篇的内容&#xff0c;在《从HelloWorld到WPF World》中&#xff0c;我们对WPF有了个大概的了解&am…

PostGIS容器运行

2019独角兽企业重金招聘Python工程师标准>>> 获取镜像&#xff1a; docker pull mdillon/postgis 该 mdillon/postgis 镜像提供了容器中运行Postgres&#xff08;内置安装PostGIS 2.5&#xff09; 。该镜像基于官方 postgres image&#xff0c;提供了多种变体&#…

小型数据库_如果您从事“小型科学”工作,那么您是否正在利用数据存储库?

小型数据库If you’re a scientist, especially one performing a lot of your research alone, you probably have more than one spreadsheet of important data that you just haven’t gotten around to writing up yet. Maybe you never will. Sitting idle on a hard dri…

BitmapEffect位图效果是简单的像素处理操作。它可以呈现下面几种特殊效果。

BitmapEffect位图效果是简单的像素处理操作。它可以呈现下面几种特殊效果。 BevelBitmapEffect 凹凸效果 BlurBitmapEffect 模糊效果 DropShadowBitmapEffect投影效果 EmbossBitmapEffect 浮雕效果 Outer…

AutoScaling 与函数计算结合,赋予更丰富的弹性能力

目前&#xff0c;弹性伸缩服务已经接入了负载均衡&#xff08;SLB&#xff09;、云数据库RDS 等云产品&#xff0c;但是暂未接入 云数据库Redis&#xff0c;有时候我们可能会需要弹性伸缩服务在扩缩容的时候自动将扩缩容涉及到的 ECS 实例私网 IP 添加到 Redis 白名单或者从 Re…

参考文献_参考

参考文献Recently, I am attracted by the news that Tanzania has attained lower middle income status under the World Bank’s classification, five years ahead of projection. Being curious on how they make the judgement, I take a look of the World Bank’s offi…

数据统计 测试方法_统计测试:了解如何为数据选择最佳测试!

数据统计 测试方法This post is not meant for seasoned statisticians. This is geared towards data scientists and machine learning (ML) learners & practitioners, who like me, do not come from a statistical background.Ť他的职位是不是意味着经验丰富的统计人…

spring的几个通知(前置、后置、环绕、异常、最终)

1、没有异常的 2、有异常的 1、被代理类接口Person.java 1 package com.xiaostudy;2 3 /**4 * desc 被代理类接口5 * 6 * author xiaostudy7 *8 */9 public interface Person { 10 11 public void add(); 12 public void update(); 13 public void delete();…

每个Power BI开发人员的Power Query提示

If someone asks you to define the Power Query, what should you say? If you’ve ever worked with Power BI, there is no chance that you haven’t used Power Query, even if you weren’t aware of it. Therefore, one could easily say that Power Query is the “he…

c# PDF 转换成图片

1.新建项目 2.新增一个新文件夹“lib”&#xff08;主要是为了存放引用的dll&#xff09; 3.将“gsdll32.dll 、PDFLibNet.dll 、PDFView.dll”3个dll添加到文件夹中 4.项目添加“PDFLibNet.dll 、PDFView.dll”2个类库的引用&#xff0c;并将gsdll32.dll 拷贝到项目生产根…

oracle 死锁

为什么80%的码农都做不了架构师&#xff1f;>>> ORA-01013: user requested cancel of current operation 转载于:https://my.oschina.net/8808/blog/2994537

a/b测试_如何进行A / B测试?

a/b测试The idea of A/B testing is to present different content to different variants (user groups), gather their reactions and user behaviour and use the results to build product or marketing strategies in the future.A / B测试的想法是将不同的内容呈现给不同…

hibernate h2变mysql_struts2-hibernate-mysql开发案例 -解道Jdon

Hibernate专题struts2-hibernate-mysql开发案例与源码源码下载本案例展示使用Struts2&#xff0c;Hibernate和MySQL数据库开发一个个人音乐管理器Web应用程序。&#xff0c;可将您的音乐收藏添加到数据库中。功能有&#xff1a;显示一个添加记录的表单和所有的音乐收藏的列表。…

提取图像感兴趣区域_从图像中提取感兴趣区域

提取图像感兴趣区域Welcome to the second post in this series where we talk about extracting regions of interest (ROI) from images using OpenCV and Python.欢迎来到本系列的第二篇文章&#xff0c;我们讨论使用OpenCV和Python从图像中提取感兴趣区域(ROI)。 As a rec…

解决java compiler level does not match the version of the installed java project facet

ava compiler level does not match the version of the installed java project facet错误的解决 因工作的关系&#xff0c;Eclipse开发的Java项目拷来拷去&#xff0c;有时候会报一个很奇怪的错误。明明源码一模一样&#xff0c;为什么项目复制到另一台机器上&#xff0c;就会…

php模板如何使用,ThinkPHP如何使用模板

到目前为止&#xff0c;我们只是使用了控制器和模型&#xff0c;还没有接触视图&#xff0c;下面来给上面的应用添加视图模板。首先我们修改下 Action 的 index 操作方法&#xff0c;添加模板赋值和渲染模板操作。PHP代码classIndexActionextendsAction{publicfunctionindex(){…

什么是嵌入式系统

在我们的日常生活中&#xff0c;我们经常使用许多使用嵌入式系统技术设计的电气和电子电路和套件。计算机&#xff0c;手机&#xff0c;平板&#xff0c;笔记本电脑&#xff0c;数字电子系统以及其他电子和电子设备都是使用嵌入式系统设计的。 什么是嵌入式系统&#xff1f;将硬…

面向数据科学家的实用统计学_数据科学家必知的统计数据

面向数据科学家的实用统计学Beginners usually ignore most foundational statistical knowledge. To understand different models, and various techniques better, these concepts are essential. These work as baseline knowledge for various concepts involved in data …