知识力量_网络分析的力量

知识力量

The most common way to store data is in what we call relational form. Most systems get analyzed as collections of independent data points. It looks something like this:

存储数据的最常见方式是我们所谓的关系形式。 大多数系统作为独立数据点的集合进行分析。 看起来像这样:

Whether you’re a spreadsheet user or a machine learning master, you’re probably used to seeing your data that way. Rows and columns representing different categories and metrics.

无论您是电子表格用户还是机器学习大师,您都可能习惯于以这种方式查看数据。 行和列代表不同的类别和指标。

However, this approach makes it very difficult to capture information about the relationships that are fundamental to so much of our world. When we stop to think about some of the common systems around us — systems which we care about understanding, optimizing and predicting — we start to see how treating these systems as independent data points misses crucial information.

但是,这种方法很难捕获有关我们世界大部分地区基本关系的信息。 当我们停止思考周围的一些常见系统(我们关心理解,优化和预测的系统)时,我们开始看到如何将这些系统视为独立数据点会丢失关键信息。

Image for post
Trade networks
贸易网络

Economies are defined by relationships and transactions more than just individual players that operate independently.

经济关系和交易的定义不仅限于独立运作的个体参与者。

Image for post
A power grid
电网

Infrastructure we use every day is highly connected. We have transportation systems linking cities and people, and communication systems linking electronic devices.

我们每天使用的基础设施紧密相连。 我们拥有连接城市和人的交通系统以及通讯系统 链接电子设备。

Image for post
Gene interactions
基因相互作用

In biology, life doesn’t emerge from cells/proteins/genes working separately, but from those components coming together and performing many interactions to make the cell alive. Even our thoughts are hidden and encoded in the connections and wiring between billions of neurons.

生物学中,生命不是来自分开工作的细胞/蛋白质/基因,而是来自那些聚集在一起并进行许多相互作用以使细胞存活的成分。 甚至我们的思想也被隐藏并编码在数十亿个神经元之间的连接和连线中。

Image for post
Social networks
社交网络

And, of course, we have social networks. This has become a term to describe social media platforms, but to be more specific, it is the data underlying these platforms, recording friendships and followers, that is useful to model as a literal network.

而且,当然,我们有社交网络。 这已成为描述社交媒体平台的术语,但更具体地说,正是这些平台的基础数据(记录友谊和关注者)对于建模为文字网络很有用。

We could keep going with endless examples. What do all of these systems have in common? Highly connected data. Essentially, anything that involves humans is highly connected. Our world isn’t just a collection of individuals isolated from everyone else, but a network of billions of members who are constantly interacting with each other. Therefore, data that describes these systems will be useful insofar as it captures those connections.

我们可以继续列举无尽的例子。 所有这些系统有什么共同点? 高度关联的数据 。 本质上,任何涉及人类的事物都是高度关联的。 我们的世界不仅是彼此孤立的个人的集合,还包括数十亿个不断互动的成员组成的网络。 因此,描述这些系统的数据将在捕获这些连接的范围内很有用。

简而言之,在许多系统后面都有一个复杂的接线图, 即网络 ,它定义了组件之间的连接。 (In short, behind many systems there is an intricate wiring diagram, a network, that defines the connections between the components.)

Although the traditional relational data model has served many domains well, highly connected systems can never be fully modeled or used for prediction unless we understand the networks behind them.

尽管传统的关系数据模型已经很好地服务于许多领域,但是除非我们了解背后的网络,否则高度连接的系统永远无法完全建模或用于预测。

Google的PageRank算法 (Google’s PageRank Algorithm)

To further understand the transformative potential of network analysis upon its introduction to a new domain, I think it’d be useful to explain its role on a platform that you likely use every day.

为了进一步了解网络分析在引入新领域后的变革潜力,我认为在您可能每天使用的平台上解释其作用很有用。

In the late 1990s and early 2000s, there were many search engines on the web. The internet was a vast, ever-evolving terrain whose users desperately needed navigation help. Many understood this need, and the field of search engines and directories was crowded.

在1990年代末和2000年代初,网络上有许多搜索引擎。 互联网是一个广阔而不断发展的领域,用户迫切需要导航帮助。 许多人都了解这种需求,因此搜索引擎和目录领域非常拥挤。

Despite being a latecomer, Google managed to surpass the competition only a few years after its founding by Larry Page and Sergey Brin in 1998. What made Google different? It modeled the internet as a network.

尽管是后来者,但Google在1998年由拉里·佩奇(Larry Page)和谢尔盖·布林(Sergey Brin)创立后仅几年就超越了竞争对手。是什么使Google与众不同? 它将互联网建模为网络。

The basic problem that search engines were faced with was: how do you measure the relevance or importance of different pages in order to determine what results to show after a user searches? There was no obvious answer. Most search engines attempted to measure the importance of a website by analyzing the content on that website itself. Similar to the way we use Excel, these entailed rows and columns, where each row was a page and each column was a variable or metric about the content on that page.

搜索引擎面临的基本问题是:如何测量不同页面的相关性或重要性,以便确定用户搜索后显示什么结果? 没有明显的答案。 大多数搜索引擎都试图通过分析网站本身的内容来衡量该网站的重要性。 与我们使用Excel的方式类似,这些包含行和列,其中每一行是一个页面,每一列是关于该页面上内容的变量或度量。

However, this is very gameable. If you wanted to look up how to bake cupcakes, for example, and the search engine you’re using determines which results to return based only on website content, I could create a cupcake-baking website in 30 minutes with all the right content to make your search engine deem my page “relevant.” But my site is unlikely to be the most relevant or highest quality for your needs. In the early days of the internet, when everyone was trying to cash in on this new medium, cupcake con-men abounded.

但是,这是非常可玩的。 例如,如果您想查找如何烘焙纸杯蛋糕,并且您使用的搜索引擎仅根据网站内容确定要返回的结果,那么我可以在30分钟内创建一个包含所有内容的纸杯蛋糕烘焙网站让您的搜索引擎认为我的页面“相关”。 但是我的网站不太可能是您需要的最相关或质量最高的网站。 在互联网的早期,当每个人都试图在这种新媒体上赚钱时,纸杯蛋糕盛装出现。

Page and Brin invented a different approach to search. They realized that they could dramatically improve the results they showed users if they first modeled the internet as a network of domains and pages that reference each other. More specifically, they developed an algorithm to detect the “role” or the “importance” of a node in a network, now called the PageRank algorithm. Once understood, the PageRank algorithm seems very simple, but it’s very powerful.

Page和Brin发明了另一种搜索方法。 他们意识到,如果他们首先将互联网建模为相互引用的域和页面的网络,则可以极大地改善向用户展示的结果。 更具体地说,他们开发了一种算法来检测网络中节点的“角色”或“重要性”,现在称为PageRank算法。 一旦理解,PageRank算法看起来很简单,但是功能非常强大。

It looks like this. Given a collection of web pages, we can keep track of all of the links or references that pages make to each other. In our model, when one page references another, we can add these references as arrows, or “edges,” pointing from the first page to the second page. We can do this across all of the pages in our collection of interest, and Google did it across all of the pages on the internet. What we end up with might look something like this:

看起来像这样。 给定一个网页集合,我们可以跟踪页面之间的所有链接或引用。 在我们的模型中,当一页引用另一页时,我们可以将这些引用添加为箭头,即从第一页指向第二页的“边”。 我们可以在我们感兴趣的所有页面上执行此操作,而Google在互联网上的所有页面上都执行此操作。 我们最终得到的结果可能是这样的:

Image for post
Some pages are referenced more than others.
有些页面被引用得比其他页面更多。

As you can imagine, and also see in the simple illustrated example, some web pages are referenced way more often than others. A web page that has authority and is relevant, unlike that which a poser just created 30 minutes ago, will be one of those nodes referenced more often. Without going into the mathematical details of the actual algorithm, we can think of PageRank as essentially measuring the “importance” or “influence” of a page based on its role in the network. By scraping the entire internet and all of the references that web pages make to each other, Google was able to calculate precisely this importance of each page, weed out the irrelevant ones, and subsequently return higher quality search results to its users.

您可以想象,也可以在简单的示例中看到,某些网页的引用频率比其他网页高。 具有权威性和相关性的网页将不同于其中一个在30分钟前创建的姿势者那样的网页,它将成为被引用次数最多的节点之一。 无需深入研究实际算法的数学细节,我们可以认为PageRank实际上是根据页面在网络中的作用来衡量页面的“重要性”或“影响力”。 通过抓取整个互联网以及网页相互之间的所有引用,Google能够精确地计算出每个网页的重要性,剔除不相关的网页,然后将更高质量的搜索结果返回给用户。

我们还能用网络做什么? (What else can we do with networks?)

A description of the potential of of networks could fill (and has filled) many books, and indeed more recently this modeling approach has garnered much interest in machine learning implementations, particularly deep learning. But all of these fancy applications still depend on the basic advantage of modeling your data as a network, similar to the way that Google grasped it: networks allow you to calculate entirely new metrics to describe and understand your data that you never would have been able to calculate previously. These metrics are many, and they are derived from various algorithms, like Google’s PageRank, that can be run once you model your data as a network.

对网络潜力的描述可能会填满(并且已经填满)许多书籍,实际上,最近,这种建模方法已经引起了人们对机器学习实现(特别是深度学习)的极大兴趣。 但是,所有这些精美的应用程序仍然依赖于将数据建模为网络的基本优势,类似于Google掌握的方式:网络使您能够计算全新的指标来描述和理解您从未有过的数据以前计算。 这些指标很多,它们衍生自各种算法(例如Google的PageRank),一旦您将数据建模为网络即可运行。

Image for post
Nodes highlighted in light blue are more connected/central.
浅蓝色突出显示的节点连接度更高/位于中心。

There are various measures of centrality, similar to pagerank, and these centrality measures correspond to many concepts that we already think about and would be interested in measuring. In a social network, for example, these might be the members who have many friends and whose opinions are highly regarded. The role of those authorities/influencers would become clear after modeling the relationships between people as a network.

有多种集中度度量类似于pagerank,这些集中度度量对应于我们已经考虑过并且将有兴趣进行度量的许多概念。 例如,在社交网络中,这些人可能是拥有许多朋友并且其观点受到高度重视的成员。 在将人与人之间的关系建模为网络之后,这些权威/影响者的作用将变得清晰。

Image for post
Networks can be used to model movement within a network, which can lead to different measures of flow/directedness.
网络可用于对网络内的移动进行建模,这可能导致流量/方向性的不同度量。

We can also measure directionality or flow within our networks, where the connections between components are essentially arrows. These might allow you to uncover patterns of movement. For example, you might be able to notice confusion or inefficiencies in transportation networks where there are a lot of cycles or zig-zags, and there are many ways to calculate that numerically given the relationships between the nodes in your network.

我们还可以测量网络中的方向性流量 ,其中组件之间的连接实质上是箭头。 这些可能使您发现运动模式。 例如,在运输网络中存在许多周期或曲折的运输网络时,您可能会注意到混乱或效率低下,并且有多种方法可以根据给定网络中节点之间的关系以数字方式进行计算。

Image for post
A network’s connectedness can be used to measure the robustness of a system or detect communities.
网络的连通性可用于衡量系统的健壮性或检测社区。

Another aspect that modelers are often interested in quantifying is a network’s connectedness. Again, there are several ways to do this, but it’s useful in many different applications.

建模人员经常对量化感兴趣的另一个方面是网络的连通性。 同样,有几种方法可以执行此操作,但是它在许多不同的应用程序中很有用。

For example, if you were modeling any type of infrastructure — transportation, trade, IT — you could use this as a measure of your infrastructure’s robustness. Or, as an e-commerce vendor, you could use this approach to find communities or clusters of customers.

例如,如果您要对任何类型的基础架构进行建模(运输,贸易,IT),则可以将其用作衡量基础架构健壮性的指标。 或者,作为电子商务供应商,您可以使用这种方法来查找客户的社区或集群。

In Conclusion

结论

Networks are an extremely exciting and useful domain of analysis, and one that is increasingly garnering interest from a wide variety of fields. In particular, the possibility of performing network analysis at scale, with datasets of billions of nodes/edges, is seen by many as one of the next big challenges in prediction and machine learning. I plan to write more about all of that in future stories, but hopefully this article gives a helpful brief introduction to the idea of networks and why they can be so powerful in many domains.

网络是一种非常令人兴奋和有用的分析领域,并且越来越引起来自各个领域的兴趣。 特别是,对数十亿个节点/边缘的数据集进行大规模网络分析的可能性被许多人视为预测和机器学习中的下一个重大挑战之一。 我计划在将来的故事中写更多有关所有这些内容的信息,但是希望本文对网络的概念以及为什么它们在许多领域如此强大的原因提供有益的简要介绍。

翻译自: https://medium.com/@ben.makansi/the-power-of-network-analysis-8a245633a36

知识力量

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391303.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

SCCM PXE客户端无法加载DP(分发点)映像

上一篇文章我们讲到了一个比较典型的PXE客户端无法找到操作系统映像的故障,今天再和大家一起分享一个关于 PXE客户端无法加载分发点映像的问题。具体的报错截图如下:从报错中我们可以看到,PXE客户端已经成功的找到了SCCM服务器,并…

Docker 入门(2)技术实现和核心组成

1. Docker 的技术实现 Docker 的实现,主要归结于三大技术: 命名空间 ( Namespaces )控制组 ( Control Groups )联合文件系统 ( Union File System ) 1.1 Namespace 命名空间可以有效地帮助Docker分离进程树、网络接口、挂载点以及进程间通信等资源。L…

marlin 三角洲_带火花的三角洲湖:什么和为什么?

marlin 三角洲Let me start by introducing two problems that I have dealt time and again with my experience with Apache Spark:首先,我介绍一下我在Apache Spark上的经历反复解决的两个问题: Data “overwrite” on the same path causing data l…

eda分析_EDA理论指南

eda分析Most data analysis problems start with understanding the data. It is the most crucial and complicated step. This step also affects the further decisions that we make in a predictive modeling problem, one of which is what algorithm we are going to ch…

基于ssm框架和freemarker的商品销售系统

项目说明 1、项目文件结构 2、项目主要接口及其实现 (1)Index: 首页页面:展示商品功能,可登录或查看商品详细信息 (2)登录:/ApiLogin 3、dao层 数据持久化层,把商品和用户…

简·雅各布斯指数第二部分:测试

In Part I, I took you through the data gathering and compilation required to rank Census tracts by the four features identified by Jane Jacobs as the foundation of a great neighborhood:在第一部分中 ,我带您完成了根据简雅各布斯(Jacobs Jacobs)所确定…

Docker 入门(3)Docke的安装和基本配置

1. Docker Linux下的安装 1.1 Docker Engine 的版本 社区版 ( CE, Community Edition ) 社区版 ( Docker Engine CE ) 主要提供了 Docker 中的容器管理等基础功能,主要针对开发者和小型团队进行开发和试验企业版 ( EE, Enterprise Edition ) 企业版 ( Docker Engi…

python:单元测试框架pytest的一个简单例子

之前一般做自动化测试用的是unitest框架,发现pytest同样不错,写一个例子感受一下 test_sample.py import cx_Oracle import config from send_message import send_message from insert_cainiao_oracle import insert_cainiao_oracledef test_cainiao_mo…

抑郁症损伤神经细胞吗_使用神经网络探索COVID-19与抑郁症之间的联系

抑郁症损伤神经细胞吗The drastic changes in our lifestyles coupled with restrictions, quarantines, and social distancing measures introduced to combat the corona virus outbreak have lead to an alarming rise in mental health issues all over the world. Social…

Docker 入门(4)镜像与容器

1. 镜像与容器 1.1 镜像 Docker镜像类似于未运行的exe应用程序,或者停止运行的VM。当使用docker run命令基于镜像启动容器时,容器应用便能为外部提供服务。 镜像实际上就是这个用来为容器进程提供隔离后执行环境的文件系统。我们也称之为根文件系统&a…

python:pytest中的setup和teardown

原文:https://www.cnblogs.com/peiminer/p/9376352.html  之前我写的unittest的setup和teardown,还有setupClass和teardownClass(需要配合classmethod装饰器一起使用),接下来就介绍pytest的类似于这类的固件。 &#…

如何开始使用任何类型的数据? - 第1部分

从数据开始 (START WITH DATA) My data science journey began with a student job in the Advanced Analytics department of one of the biggest automotive manufacturers in Germany. I was nave and still doing my masters.我的数据科学之旅从在德国最大的汽车制造商之一…

iHealth基于Docker的DevOps CI/CD实践

本文由1月31日晚iHealth运维技术负责人郭拓在Rancher官方技术交流群内所做分享的内容整理而成,分享了iHealth从最初的服务器端直接部署,到现在实现全自动CI/CD的实践经验。作者简介郭拓,北京爱和健康科技有限公司(iHealth)。负责公…

机器学习图像源代码_使用带有代码的机器学习进行快速房地产图像分类

机器学习图像源代码RoomNet is a very lightweight (700 KB) and fast Convolutional Neural Net to classify pictures of different rooms of a house/apartment with 88.9 % validation accuracy over 1839 images. I have written this in python and TensorFlow.RoomNet是…

leetcode 938. 二叉搜索树的范围和

给定二叉搜索树的根结点 root,返回值位于范围 [low, high] 之间的所有结点的值的和。 示例 1: 输入:root [10,5,15,3,7,null,18], low 7, high 15 输出:32 示例 2: 输入:root [10,5,15,3,7,13,18,1,nul…

COVID-19和世界幸福报告数据告诉我们什么?

For many people, the idea of ​​staying home actually sounded good at first. This process was really efficient for Netflix and Amazon. But then sad truths awaited us. What was boring was the number of dead and intubated patients one after the other. We al…

iOS 开发一定要尝试的 Texture(ASDK)

原文链接 - iOS 开发一定要尝试的 Texture(ASDK)(排版正常, 包含视频) 前言 本篇所涉及的性能问题我都将根据滑动的流畅性来评判, 包括掉帧情况和一些实际体验 ASDK 已经改名为 Texture, 我习惯称作 ASDK 编译环境: MacOS 10.13.3, Xcode 9.2 参与测试机型: iPhone 6 10.3.3, i…

lisp语言是最好的语言_Lisp可能不是数据科学的最佳语言,但是我们仍然可以从中学到什么呢?...

lisp语言是最好的语言This article is in response to Emmet Boudreau’s article ‘Should We be Using Lisp for Data-Science’.本文是对 Emmet Boudreau的文章“我们应该将Lisp用于数据科学”的 回应 。 Below, unless otherwise stated, lisp refers to Common Lisp; in …

static、volatile、synchronize

原子性(排他性):不论是多核还是单核,具有原子性的量,同一时刻只能有一个线程来对它进行操作!可见性:多个线程对同一份数据操作,thread1改变了某个变量的值,要保证thread2…

1.10-linux三剑客之sed命令详解及用法

内容:1.sed命令介绍2.语法格式,常用功能查询 增加 替换 批量修改文件名第1章 sed是什么字符流编辑器 Stream Editor第2章 sed功能与版本处理出文本文件,日志,配置文件等增加,删除,修改,查询sed --versionsed -i 修改文件内容第3章 语法格式3.1 语法格式sed [选项] [sed指令…