深度学习数据集中数据差异大
The modern world runs on “big data,” the massive data sets used by governments, firms, and academic researchers to conduct analyses, unearth patterns, and drive decision-making. When it comes to data analysis, bigger can be better: The more high-quality data is incorporated, the more robust the analysis will be. Large-scale data analysis is becoming increasingly powerful thanks to machine learning and has a wide range of benefits, such as informing public-health research, reducing traffic, and identifying systemic discrimination in loan applications.
现代世界基于“大数据”,这是政府,企业和学术研究人员用来进行分析,挖掘模式和推动决策的海量数据集。 当涉及到数据分析时,越大越好:合并的数据越多,分析就越鲁棒。 得益于机器学习,大规模数据分析正变得越来越强大,并具有广泛的好处,例如为公共卫生研究提供信息,减少流量以及识别贷款申请中的系统性歧视。
But there’s a downside to big data, as it requires aggregating vast amounts of potentially sensitive personal information. Whether amassing medical records, scraping social media profiles, or tracking banking and credit card transactions, data scientists risk jeopardizing the privacy of the individuals whose records they collect. And once data is stored on a server, it may be stolen, shared, or compromised. “Improper disclosure of such data can have adverse consequences for a data subject’s private information, or even lead to civil liability or bodily harm,” explains data scientist An Nguyen, in his article, “Understanding Differential Privacy.”
但是大数据有一个缺点,因为它需要汇总大量潜在的敏感个人信息。 无论是收集病历,抓取社交媒体资料还是跟踪银行和信用卡交易,数据科学家都有可能危及收集其记录的个人的隐私。 数据一旦存储在服务器上,就可能被盗,共享或泄露。 数据科学家An Nguyen在他的文章“ 了解差异隐私 ”中解释说:“不当披露此类数据可能会对数据主体的私人信息产生不利影响,甚至导致民事责任或人身伤害。”
Computer scientists have worked for years to try to find ways to make data more private, but even if they attempt to de-identify data — for example, by removing individuals’ names or other parts of a data set — it is often possible for others to “connect the dots” and piece together information from multiple sources to determine a supposedly anonymous individual’s identity (via a so-called re-identification or linkage attack).
计算机科学家已经工作了多年,试图找到使数据更具私密性的方法,但是即使他们试图去识别数据(例如,通过删除个人的姓名或数据集的其他部分),对于其他人也通常是可能的。 “点点滴滴”,并将来自多个来源的信息拼凑起来,以确定所谓的匿名个人的身份(通过所谓的重新识别或链接攻击)。
Fortunately, in recent years, computer scientists have developed a promising new approach to privacy-preserving data analysis known as “differential privacy” that allows researchers to unearth the patterns within a data set — and derive observations about the population as a whole — while obscuring the information about each individual’s records.
幸运的是,近年来,计算机科学家为保护隐私的数据分析开发了一种很有前途的新方法,称为“差异性隐私”,它使研究人员能够挖掘数据集内的模式,并获得有关总体人口的观察结果,同时又能掩盖整体情况。有关每个人的记录的信息。
解决方案:差异隐私 (The solution: differential privacy)
Differential privacy (also known as “epsilon indistinguishability”) was first developed in 2006 by Cynthia Dwork, Frank McSherry, Kobbi Nissim and Adam Smith. In a 2016 lecture, Dwork defined differential privacy as being achieved when “the outcome of any analysis is essentially equally likely, independent of whether any individual joins, or refrains from joining, the dataset.”
差异性隐私(也称为“ε不可区分性”)由Cynthia Dwork,Frank McSherry,Kobbi Nissim和Adam Smith 于2006年首次开发 。 在2016年的一次演讲中,Dwork 将差异性隐私定义为“当任何分析的结果基本上具有同等可能性,而与任何个人加入还是拒绝加入数据集无关时”。
How is this possible? Differential privacy works by adding a pre-determined amount of randomness, or “noise,” into a computation performed on a data set. As an example, imagine if five people submit “yes” or “no” about a question on a survey, but before their responses are accepted, they have to flip a coin. If they flip heads, they answer the question honestly. But if they flip tails, they have to re-flip the coin, and if the second toss is tails, they respond “yes,” and if heads, they respond “no” — regardless of their actual answer to the question.
这怎么可能? 差分隐私通过将预定量的随机性或“噪声”添加到对数据集执行的计算中而起作用。 例如,假设有五个人对调查中的一个问题回答“是”或“否”,但是在他们的回答被接受之前,他们必须掷硬币。 如果他们低着头,他们会诚实地回答这个问题。 但是,如果他们甩尾巴,就必须重新掷硬币;如果第二次抛硬币是尾巴,则他们回答“是”,如果是抛头,则他们回答“否”,而不管他们对问题的实际回答如何。
As a result of this process, we would expect a quarter of respondents (0.5 x 0.5 — those who flip tails and tails) to answer “yes,” even if their actual answer would have been “no”. With sufficient data, the researcher would be able to factor in this probability and still determine the overall population’s response to the original question, but every individual in the data set would be able to plausibly deny that their actual response was included.
作为此过程的结果,我们希望四分之一的受访者(0.5 x 0.5-那些甩尾巴的人)回答“是”,即使他们的实际回答是“否”。 有了足够的数据,研究人员将能够考虑这一可能性,并且仍然可以确定总体人群对原始问题的回答,但是数据集中的每个人都可以合理地否认包括了他们的实际回答。
Of course, researchers don’t actually use coin tosses and instead rely on algorithms that, based on a pre-determined probability, similarly alter some of the responses in the data set. The more responses are changed by the algorithm, the more the privacy is preserved for the individuals in the data set. The trade-off, of course, is that as more “noise” is added to the computation — that is, as a greater percentage of responses are changed — then the accuracy of the data analysis goes down.
当然,研究人员实际上并没有使用抛硬币,而是依靠基于预定概率的算法来类似地更改数据集中的某些响应。 该算法更改的响应越多,为数据集中的个人保留的隐私越多。 当然,要权衡的是,随着计算中添加更多“噪声”(即,随着更大百分比的响应发生变化),数据分析的准确性就会下降。
When Dwork and her colleagues first defined differential privacy, they used the Greek symbol ε, or epsilon, to mathematically define the privacy loss associated with the release of data from a data set. This value defines just how much differential privacy is provided by a particular algorithm: The lower the value of epsilon, the more each individual’s privacy is protected. The higher the epsilon, the more accurate the data analysis — but the less privacy is preserved.
当Dwork和她的同事首次定义差异隐私时,他们使用希腊符号ε或epsilon在数学上定义与从数据集中释放数据相关的隐私损失。 此值仅定义特定算法提供多少差异隐私:epsilon值越低,每个人的隐私受到保护的程度就越高。 ε越高,数据分析越准确-但是保留的隐私越少。
When the data is perturbed (i.e. the “noise” is added) while still on a user’s device, it’s known as local differential privacy. When the noise is added to a computation after the data has been collected, it’s called central differential privacy. With this latter method, the more you query a data set, the more information risks being leaked about the individual records. Therefore, the central model requires constantly searching for new sources of data to maintain high levels of privacy.
当数据仍在用户设备上而受到干扰(即添加了“噪声”)时,称为本地差异隐私。 在收集数据后将噪声添加到计算中时,这称为中央差分隐私。 使用后一种方法,您查询数据集的次数越多,就越有可能泄露有关各个记录的信息。 因此,中心模型要求不断搜索新的数据源以保持高度的隐私。
Either way, a key goal of differential privacy is to ensure that the results of a given query will not be affected by the presence (or absence) of a single record. Differential privacy also makes data less attractive to would-be attackers and can help prevent them from connecting personal data from multiple platforms.
无论哪种方式,差异隐私的主要目标都是确保给定查询的结果不会受到单个记录的存在(或不存在)的影响。 差异性隐私还使数据对潜在的攻击者的吸引力降低,并且可以帮助防止他们连接来自多个平台的个人数据。
实践中的差异隐私 (Differential privacy in practice)
Differential privacy has already gained widespread adoption by governments, firms, and researchers. It is already being used for “disclosure avoidance” by the U.S. census, for example, and Apple uses differential privacy to analyze user data ranging from emoji suggestions to Safari crashes. Google has even released an open-source version of a differential privacy library used in many of the company’s core products.
差异性隐私已被政府,公司和研究人员广泛采用。 例如, 美国人口普查已经将其用于“避免泄露”,而Apple使用差异隐私来分析用户数据,从表情符号建议到Safari崩溃。 谷歌甚至发布了该公司许多核心产品中使用的差异隐私库的开源版本。
Using a concept known as “elastic sensitivity” developed in recent years by researchers at UC Berkeley, differential privacy is being extended into real-world SQL queries. The ride-sharing service Uber adopted this approach to study everything from traffic patterns to drivers’ earnings, all while protecting users’ privacy. By incorporating elastic sensitivity into a system that requires massive amounts of user data to connect riders with drivers, the company can help protect its users from a snoop.
加州大学伯克利分校的研究人员近年来使用一种称为“弹性敏感性”的概念,将差分隐私扩展到了实际SQL查询中。 乘车共享服务Uber采用这种方法研究了从交通方式到驾驶员收入的所有内容,同时保护了用户的隐私。 通过将弹性敏感度纳入需要大量用户数据才能将骑手与驾驶员连接起来的系统,该公司可以帮助保护其用户免遭窥探。
Consider, for example, how implementing elastic sensitivity could protect a high-profile Uber user, such as Ivanka Trump. As Andy Greenberg wrote in Wired: “If an Uber business analyst asks how many people are currently hailing cars in midtown Manhattan — perhaps to check whether the supply matches the demand — and Ivanka Trump happens to requesting an Uber at that moment, the answer wouldn’t reveal much about her in particular. But if a prying analyst starts asking the same question about the block surrounding Trump Tower, for instance, Uber’s elastic sensitivity would add a certain amount of randomness to the result to mask whether Ivanka, specifically, might be leaving the building at that time.”
例如,考虑实现弹性敏感性如何保护著名的Uber用户,例如Ivanka Trump。 正如安迪·格林伯格(Andy Greenberg)在《 连线 》中写道:“如果一个优步业务分析师询问目前曼哈顿中城有多少人在叫车(也许是为了检查供应量是否符合需求),而伊万卡·特朗普当时恰好要求一个优步,答案就不会特别是没有透露太多关于她的信息。 但是,例如,如果一个撬动的分析师开始对特朗普大厦周围的街区提出相同的问题,那么优步的弹性敏感性将给结果增加一定程度的随机性,以掩盖伊万卡是否特别是在那时可能离开建筑物。”
Still, for all its benefits, most organizations are not yet using differential privacy. It requires large data sets, it is computationally intensive, and organizations may lack the resources or personnel to deploy it. They also may not want to reveal how much private information they’re using — and potentially leaking.
尽管有其所有优点,但大多数组织仍未使用差异隐私。 它需要大量的数据集,计算量很大,并且组织可能缺乏部署它的资源或人员。 他们可能也不想透露正在使用多少私人信息,并且有可能泄露信息。
Another concern is that organizations that use differential privacy may be overstating how much privacy they’re providing. A firm may claim to use differential privacy, but in practice could use such a high epsilon value that the actual privacy provided would be limited.
另一个担忧是,使用差异隐私的组织可能夸大了他们提供的隐私数量。 公司可能声称使用差别隐私,但实际上可能会使用很高的ε值,以致实际提供的隐私将受到限制。
Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community.
考虑到这些实施细节的重要性,需要在不同的隐私社区之间共享学习。
To address whether differential privacy is being properly deployed, Dwork, together with UC Berkeley researchers Nitin Kohli and Deirdre Mulligan, have proposed the creation of an “Epsilon Registry” to encourage companies to be more transparent. “Given the importance of these implementation details there is a need for shared learning amongst the differential privacy community,” they wrote in the Journal of Privacy and Confidentiality. “To serve these purposes, we propose the creation of the Epsilon Registry — a publicly available communal body of knowledge about differential privacy implementations that can be used by various stakeholders to drive the identification and adoption of judicious differentially private implementations.”
为了解决差异性隐私是否得到适当部署,Dwork与加州大学伯克利分校的研究人员Nitin Kohli和Deirdre Mulligan共同建议创建“ Epsilon注册中心”,以鼓励公司提高透明度。 他们在《隐私与机密性杂志 》上写道: “鉴于这些实施细节的重要性,因此需要在不同的隐私社区之间进行共同学习。” “为实现这些目的,我们建议创建Epsilon注册管理机构-一个公开的公共社区,以了解差异隐私实施的知识,各种利益相关者可以使用该知识体系来推动识别和采用明智的差异私有实施。”
As a final note, organizations should not rely on differential privacy alone, but rather should use it as just one defense in a broader arsenal, alongside other measures, like encryption and access control. Organizations should disclose the sources of data they’re using for their analysis, along with what steps they’re taking to protect that data. Combining such practices with differential privacy with low epsilon values will go a long way in helping to realize the benefits of “big data” while reducing the leakage of sensitive personal data.
最后要说明的是,组织不应仅依赖于差异性隐私,而应将其用作更广泛的武器库中的一种防御措施,以及诸如加密和访问控制之类的其他措施。 企业应披露其用于分析的数据源,以及他们将采取哪些步骤来保护这些数据。 将此类做法与具有较低epsilon值的差异性隐私相结合,将有助于帮助实现“大数据”的好处,同时减少敏感个人数据的泄漏。
This article was cross-posted by Brookings TechStream. The video was animated by Annalise Kamegawa. The Center for Long-Term Cybersecurity would like to thank Nitin Kohli, PhD student in the UC Berkeley School of Information, and Paul Laskowski, Assistant Adjunct Professor in the UC Berkeley School of Information, for providing their expertise to review this video and article.
本文由 Brookings TechStream 交叉发布 。 该视频由 Annalize Kamegawa 制作动画 。 长期网络安全中心要感谢加州大学伯克利分校信息学院的博士生Nitin Kohli和加州大学伯克利分校信息学院的助理兼职教授Paul Laskowski,他们提供了专业知识来审阅此视频和文章。
翻译自: https://medium.com/cltc-bulletin/using-differential-privacy-to-harness-big-data-and-preserve-privacy-349d84799862
深度学习数据集中数据差异大
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388759.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!