在加利福尼亚州投资于新餐馆:一种数据驱动的方法

“It is difficult to make predictions, especially about the future.”

“很难做出预测,尤其是对未来的预测。”

~Niels Bohr

〜尼尔斯·波尔

Everything is better interpreted through data. And data-driven decision making is crucial for success in any industry.

通过数据可以更好地解释一切。 数据驱动的决策对于任何行业的成功都是至关重要的。

And it has been true since time immemorable. The difference now is that we have, for better, developed a healthy outlook to data, and we have much more data available to us than previous times. And we have, in our disposal, computing powers previously unimagined.

自从难忘的时光以来,这就是事实。 现在的区别在于,我们更好地发展了健康的数据前景,并且我们拥有比以前更多的数据。 而且,我们拥有以前无法想象的计算能力。

In this situation, the computing power and the data should be leveraged to make better decisions to solve business problems.

在这种情况下,应利用计算能力和数据做出更好的决策来解决业务问题。

In my project, I chose to provide recommendations for opening new eateries in California City. In this project, I provided a concrete list of recommendations to invest in. Eatery types (such as- Japanese restaurant, dessert shop, etc.) and respective counties were suggested.

在我的项目中,我选择为在加利福尼亚市开设新餐馆提供建议。 在这个项目中,我提供了一份具体的投资建议清单。对餐馆类型(例如日式餐厅,甜点店等)和各个县提出了建议。

In this post, I will go over the full process of a Data Science project.

在本文中,我将介绍数据科学项目的整个过程。

数据源 (Data Sources)

For solving this problem, data from four sources have been leveraged-

为了解决这个问题,我们利用了来自四个来源的数据-

  1. Location data titled “California Counties” provided in California Open Data Portal provided by the Government of California for the geographical location data.

    加利福尼亚政府提供的加利福尼亚开放数据门户中提供的地理位置数据称为“加利福尼亚县” 。

  2. The Foursquare API for information about established restaurants and other relevant detailed information about the same.

    Foursquare API,用于提供有关已建立餐厅的信息以及有关该餐厅的其他相关详细信息。

  3. County-wise population data from the US Government Census site.

    来自美国政府人口普查站点的县级人口数据。

  4. County-wise Real GDP data provided by the Bureau of Economic Analysis, U.S. Department of Commerce.

    美国商务部经济分析局提供的县级实际GDP数据。

探索性数据分析 (Exploratory Data Analysis)

After cleaning the data (which is definitely more than 90% of a Data Scientist’s job), meaningful insights were gained from the data.

清理数据后( 绝对超过数据科学家工作的90% ),从数据中获得了有意义的见解。

Image for post
City Centers of California’s Counties, source: Author
加利福尼亚州县城中心,资料来源:作者

It was also found that the GDPs of the counties are strongly correlated with the Populations of the counties. Thus making counties with high GDPs and high populations attractive destination of investment.

还发现县的GDP与县的人口密切相关。 因此,具有高GDP和高人口的县成为吸引投资的目的地。

Image for post
Strong Correlation Between GDP and Population of Californian Counties, source: Author
加利福尼亚县的GDP与人口之间的强相关性,来源:作者
Image for post
Number of Eateries in Each County (capped at 50 by Foursquare), source: Author
每个县的餐馆数量(Foursquare限制为50),来源:作者

With the information provided by the Foursquare API, a list of ten most common venues was obtained for each county. This will be leveraged in decision making.

借助Foursquare API提供的信息,获得了每个县的十个最常见的场所列表。 这将在决策中加以利用。

Image for post
Five Row
五排

应用机器学习模型 (Applying Machine Learning Model)

选择算法 (Choosing Algorithm)

The business problem is to look for eatery types and locations to invest in. The data is not labeled. This renders the problem to be solved a classical application of unsupervised learning.

业务问题是寻找餐馆类型和投资地点。数据未标记。 这使得要解决的问题成为无监督学习的经典应用。

The aim is not to look for value or look for a class. The aim is not to suggest someone only one recommendation for investment. To suggest the stakeholders a list of likely venues is the goal.

目的不是寻找价值或寻找阶级。 目的不是建议某人仅提出一项投资建议。 向利益相关者建议可能的场所清单是目标。

And this can be achieved by clustering the counties based on GDP and Population. And KMeans Clustering is the best Statistical Learning algorithm to achieve this.

这可以通过基于GDP和人口对县进行聚类来实现。 而KMeans聚类是实现这一目标的最佳统计学习算法。

Scikit-learn library’s implementation for the KMeans Clustering algorithm was used.

使用了Scikit-learn库的KMeans聚类算法实现。

选择k (Choosing k)

For choosing the best k for clustering, the elbow method was employed.

为了选择最佳的k进行聚类,采用了弯头法。

Image for post
Inertia vs. Values of k Plot, source: Author
惯性与k图的值的关系,来源:作者

As evident from the graph, the best k is 4. Hence, the clustering algorithm was applied with k = 4. So, 4 clusters of counties were formed based on population and GDP of the counties.

从图中可以看出,最佳k为4。因此,在k = 4时应用了聚类算法。因此,根据县的人口和GDP形成了4个县集群。

结果 (Results)

4 clusters were formed containing counties. Upon examination, it was found that Los Angeles county formed one cluster (cluster-2) with itself due to its comparatively abysmally high GDP and population. Counties in another cluster had high GDP and high population, but not anywhere close to the Los Angeles county. Orange, Santa Clara, and San Diego are the three counties in this cluster (cluster-3). Then there are counties with low GDP and low populations such as Plumas, Nevada, Sierra, etc. in one cluster (cluster-1), and mid-range GDP and population, such as Sacramento, Riverside, etc. in another cluster (cluster-4).

形成了包含县的4个集群。 经检查,发现洛杉矶县因其GDP和人口相对较高而与其自身形成了一个集群(集群2)。 另一个集群中的县的GDP较高且人口众多,但洛杉矶县附近没有。 奥兰治,圣克拉拉和圣地亚哥是该集群中的三个县(集群3)。 然后是一个集群(集群1)中的Plumas,内华达州,塞拉利昂等GDP较低且人口较少的县(另一个集群)(萨克拉曼多,河滨等)中部GDP和人口较低的县(集群) -4)。

Image for post
Resulting Clusters on a Map, source: Author
地图上的结果集群,来源:作者

In clusters 2, 3 we have counties with a high population and high GDP. In these counties, it will be profitable to invest in any eatery while it is advisable to invest in an eatery that is not in the top 3 venues.

在第2、3组中,我们的县人口众多,GDP很高。 在这些县中,投资于任何一家餐馆都是有利可图的,而建议投资于不在前三名场所中的餐馆则是有利的。

In cluster 4, the population and GDP of counties are higher than those of the counties in cluster 1 but lower than those of counties in 2 or 3. Investment in these counties is preferred after a county in cluster 2 and cluster 3, in that order. Investment should be done in uncommon eateries so that they face lesser competition.

在集群4中,县的人口和GDP高于集群1中的县,但低于集群2或3中的县。在这些县中投资优先于集群2和集群3中的县。 。 应该在不常见的餐馆里进行投资,以使他们面临的竞争更少。

Cluster 1 is dominated by lower population counties. Investment in these counties should be preferred after investments in counties in clusters 2 or 3 or cluster 4. Investment in most common eateries is not advised at all. Investment in these counties is least advised.

集群1由人口较少的县主导。 在对第2组或第3组或第4组的县进行投资之后,应该优先选择对这些县进行投资。 建议不要在这些县进行投资。

After suggesting investment options, tables for each cluster were formed with eatery types, not in the three most common types.

在提出投资选择建议之后,每个集群的表格都是用餐馆类型构成的,而不是三种最常见的类型。

Image for post
Table for Counties and Investment Recommendations in Cluster 3
表3中的县和投资建议表

Full Report Link: PDF in GitHub RepositoryNotebook with Full Code: NB Viewer

完整报告链接: GitHub存储库笔记本中的PDF ,完整代码: NB Viewer

Feel free to comment, provide feedback, or criticize.

随时发表评论,提供反馈或批评。

Connect with me on LinkedIn or Twitter.

在LinkedIn或Twitter上与我联系。

This blog post is related to Applied Data Science Capstone Project offered by IBM through Coursera.

这篇博客文章与IBM通过Coursera提供的Applied Data Science Capstone Project有关。

翻译自: https://medium.com/beginning-data-science/investing-in-a-new-eastery-in-california-a-data-driven-approach-e91229e0289e

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390940.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

javascript脚本_使用脚本src属性将JavaScript链接到HTML

javascript脚本The ‘src’ attribute in a tag is the path to an external file or resource that you want to link to your HTML document.标记中的src属性是您要链接到HTML文档的外部文件或资源的路径。 For example, if you had your own custom JavaScript file named …

阿里云ESC上的Ubuntu图形界面的安装

系统装的是Ubuntu Server 16.04 64位版的图形界面,这里是转载的一个大神的帖子 http://blog.csdn.net/dk_0228/article/details/54571867, 当然自己也再记录一下,加深点印象 1.更新apt-get 保证最新 apt-get update 2.用putty或者Xshell连接远…

leetcode 1269. 停在原地的方案数(dp)

示例 1: 输入:steps 3, arrLen 2 输出:4 解释:3 步后,总共有 4 种不同的方法可以停在索引 0 处。 向右,向左,不动 不动,向右,向左 向右,不动,向…

JavaScript Onclick事件解释

The onclick event in JavaScript lets you as a programmer execute a function when an element is clicked.JavaScript中的onclick事件可让您作为程序员在单击元素时执行功能。 按钮Onclick示例 (Button Onclick Example) <button onclick"myFunction()">C…

近似算法的近似率_选择最佳近似最近算法的数据科学家指南

近似算法的近似率by Braden Riggs and George Williams (gwilliamsgsitechnology.com)Braden Riggs和George Williams(gwilliamsgsitechnology.com) Whether you are new to the field of data science or a seasoned veteran, you have likely come into contact with the te…

VMware安装CentOS之二——最小化安装CentOS

1、上文已经创建了一个虚拟机&#xff0c;现在我们点击开启虚拟机。2、虚拟机进入到安装的界面&#xff0c;在这里我们选择第一行&#xff0c;安装或者升级系统。3、这里会提示要检查光盘&#xff0c;我们直接选择跳过。4、这里会提示我的硬件设备不被支持&#xff0c;点击OK&a…

什么是GraphQL? 普通神话被揭穿。

I love talking about GraphQL, especially with people who have been working with GraphQL or thinking of adopting GraphQL. One common question people have is why someone would want to move to GraphQL from REST. 我喜欢谈论GraphQL&#xff0c;特别是和那些一直在…

在Spring Boot里面,怎么获取定义在application.properties文件里的值

问题&#xff1a;在Spring Boot里面&#xff0c;怎么获取定义在application.properties文件里的值、 我想访问application.properties里面提供的值&#xff0c;像这样&#xff1a; logging.level.org.springframework.web: DEBUG logging.level.org.hibernate: ERROR logging…

连接sqlexpress

sqlexpress在visualstudio安装时可选择安装。   数据源添加 localhost\sqlexpress window身份认证即可。转载于:https://www.cnblogs.com/zjxbetter/p/7767241.html

在Python中使用Seaborn和WordCloud可视化YouTube视频

I am an avid Youtube user and love watching videos on it in my free time. I decided to do some exploratory data analysis on the youtube videos streamed in the US. I found the dataset on the Kaggle on this link我是YouTube的狂热用户&#xff0c;喜欢在业余时间…

Win下更新pip出现OSError:[WinError17]与PerrmissionError:[WinError5]及解决

环境&#xff1a;Win7 64位&#xff0c;python3.6.0 我在准备用pip装东西的时候&#xff0c;在cmd里先更新了一下pip&#xff0c;大概是9.0.1更新到9.0. 尝试更新pip命令&#xff1a; pip install --upgrade pip 更新一半挂了 出现了 OSError:[WinError17] 与 PerrmissionError…

老生常谈:抽象工厂模式

在创建型模式中有一个模式是不得不学的,那就是抽象工厂模式(Abstract Factory),这是创建型模式中最为复杂,功能最强大的模式.它常与工厂方法组合来实现。平时我们在写一个组件的时候一般只针对一种语言,或者说是针对一个区域的人来实现。 例如:现有有一个新闻组件,在中国我们有…

ogc是一个非营利性组织_非营利组织的软件资源

ogc是一个非营利性组织Please note that freeCodeCamp is not partnered with, nor do we receive a referral fee from, any of the following providers. We simply want to help guide you toward a solution for your organization.请注意&#xff0c;freeCodeCamp不与以下…

数据结构入门最佳书籍_最佳数据科学书籍

数据结构入门最佳书籍Introduction介绍 I get asked a lot what resources I recommend for people who want to start their Data Science journey. This section enlists books I recommend you should read at least once in your life as a Data Scientist.我被很多人问到…

函数式编程概念

什么是函数式编程 简单地说&#xff0c;函数式编程通过使用函数&#xff0c;将值转换成抽象单元&#xff0c;接着用于构建软件系统。 面向对象VS函数式编程 面向对象编程 面向对象编程认为一切事物皆对象&#xff0c;将现实世界的事物抽象成对象&#xff0c;现实世界中的关系抽…

在Java里面怎么样在静态方法中调用getClass()?

问题&#xff1a;在Java里面怎么样在静态方法中调用getClass()&#xff1f; 我有一个类&#xff0c;它必须包含一些静态方法&#xff0c;在这些静态方法里面我需要像下面那样调用getClass() 方法 public static void startMusic() {URL songPath getClass().getClassLoader(…

变量名和变量地址

变量名和变量地址 研一时&#xff0c;很偶然的翻开谭浩强老先生的《C程序设计》&#xff08;是师姐的书&#xff0c;俺的老早就卖了&#xff0c;估计当时觉得这本书写得不够好&#xff09;&#xff0c;很偶然的看到关于变量名的一段话&#xff1a;“变量名实际上是一个符号地址…

多重插补 均值插补_Feature Engineering Part-1均值/中位数插补。

多重插补 均值插补Understanding the Mean /Median Imputation and Implementation using feature-engine….!了解使用特征引擎的均值/中位数插补和实现…。&#xff01; 均值或中位数插补&#xff1a; (Mean or Median Imputation:) The mean or median value should be calc…

域 嵌入图像显示不出来_如何(以及为什么)将域概念嵌入代码中

域 嵌入图像显示不出来Code should clearly reflect the problem it’s solving, and thus openly expose that problem’s domain. Embedding domain concepts in code requires thought and skill, and doesnt drop out automatically from TDD. However, it is a necessary …

linux 查看用户上次修改密码的日期

查看root用户密码上次修改的时间 方法一&#xff1a;查看日志文件&#xff1a; # cat /var/log/secure |grep password changed 方法二&#xff1a; # chage -l root-----Last password change : Feb 27, 2018 Password expires : never…