在加利福尼亚州投资于新餐馆:一种数据驱动的方法

“It is difficult to make predictions, especially about the future.”

“很难做出预测,尤其是对未来的预测。”

~Niels Bohr

〜尼尔斯·波尔

Everything is better interpreted through data. And data-driven decision making is crucial for success in any industry.

通过数据可以更好地解释一切。 数据驱动的决策对于任何行业的成功都是至关重要的。

And it has been true since time immemorable. The difference now is that we have, for better, developed a healthy outlook to data, and we have much more data available to us than previous times. And we have, in our disposal, computing powers previously unimagined.

自从难忘的时光以来,这就是事实。 现在的区别在于,我们更好地发展了健康的数据前景,并且我们拥有比以前更多的数据。 而且,我们拥有以前无法想象的计算能力。

In this situation, the computing power and the data should be leveraged to make better decisions to solve business problems.

在这种情况下,应利用计算能力和数据做出更好的决策来解决业务问题。

In my project, I chose to provide recommendations for opening new eateries in California City. In this project, I provided a concrete list of recommendations to invest in. Eatery types (such as- Japanese restaurant, dessert shop, etc.) and respective counties were suggested.

在我的项目中,我选择为在加利福尼亚市开设新餐馆提供建议。 在这个项目中,我提供了一份具体的投资建议清单。对餐馆类型(例如日式餐厅,甜点店等)和各个县提出了建议。

In this post, I will go over the full process of a Data Science project.

在本文中,我将介绍数据科学项目的整个过程。

数据源 (Data Sources)

For solving this problem, data from four sources have been leveraged-

为了解决这个问题,我们利用了来自四个来源的数据-

  1. Location data titled “California Counties” provided in California Open Data Portal provided by the Government of California for the geographical location data.

    加利福尼亚政府提供的加利福尼亚开放数据门户中提供的地理位置数据称为“加利福尼亚县” 。

  2. The Foursquare API for information about established restaurants and other relevant detailed information about the same.

    Foursquare API,用于提供有关已建立餐厅的信息以及有关该餐厅的其他相关详细信息。

  3. County-wise population data from the US Government Census site.

    来自美国政府人口普查站点的县级人口数据。

  4. County-wise Real GDP data provided by the Bureau of Economic Analysis, U.S. Department of Commerce.

    美国商务部经济分析局提供的县级实际GDP数据。

探索性数据分析 (Exploratory Data Analysis)

After cleaning the data (which is definitely more than 90% of a Data Scientist’s job), meaningful insights were gained from the data.

清理数据后( 绝对超过数据科学家工作的90% ),从数据中获得了有意义的见解。

Image for post
City Centers of California’s Counties, source: Author
加利福尼亚州县城中心,资料来源:作者

It was also found that the GDPs of the counties are strongly correlated with the Populations of the counties. Thus making counties with high GDPs and high populations attractive destination of investment.

还发现县的GDP与县的人口密切相关。 因此,具有高GDP和高人口的县成为吸引投资的目的地。

Image for post
Strong Correlation Between GDP and Population of Californian Counties, source: Author
加利福尼亚县的GDP与人口之间的强相关性,来源:作者
Image for post
Number of Eateries in Each County (capped at 50 by Foursquare), source: Author
每个县的餐馆数量(Foursquare限制为50),来源:作者

With the information provided by the Foursquare API, a list of ten most common venues was obtained for each county. This will be leveraged in decision making.

借助Foursquare API提供的信息,获得了每个县的十个最常见的场所列表。 这将在决策中加以利用。

Image for post
Five Row
五排

应用机器学习模型 (Applying Machine Learning Model)

选择算法 (Choosing Algorithm)

The business problem is to look for eatery types and locations to invest in. The data is not labeled. This renders the problem to be solved a classical application of unsupervised learning.

业务问题是寻找餐馆类型和投资地点。数据未标记。 这使得要解决的问题成为无监督学习的经典应用。

The aim is not to look for value or look for a class. The aim is not to suggest someone only one recommendation for investment. To suggest the stakeholders a list of likely venues is the goal.

目的不是寻找价值或寻找阶级。 目的不是建议某人仅提出一项投资建议。 向利益相关者建议可能的场所清单是目标。

And this can be achieved by clustering the counties based on GDP and Population. And KMeans Clustering is the best Statistical Learning algorithm to achieve this.

这可以通过基于GDP和人口对县进行聚类来实现。 而KMeans聚类是实现这一目标的最佳统计学习算法。

Scikit-learn library’s implementation for the KMeans Clustering algorithm was used.

使用了Scikit-learn库的KMeans聚类算法实现。

选择k (Choosing k)

For choosing the best k for clustering, the elbow method was employed.

为了选择最佳的k进行聚类,采用了弯头法。

Image for post
Inertia vs. Values of k Plot, source: Author
惯性与k图的值的关系,来源:作者

As evident from the graph, the best k is 4. Hence, the clustering algorithm was applied with k = 4. So, 4 clusters of counties were formed based on population and GDP of the counties.

从图中可以看出,最佳k为4。因此,在k = 4时应用了聚类算法。因此,根据县的人口和GDP形成了4个县集群。

结果 (Results)

4 clusters were formed containing counties. Upon examination, it was found that Los Angeles county formed one cluster (cluster-2) with itself due to its comparatively abysmally high GDP and population. Counties in another cluster had high GDP and high population, but not anywhere close to the Los Angeles county. Orange, Santa Clara, and San Diego are the three counties in this cluster (cluster-3). Then there are counties with low GDP and low populations such as Plumas, Nevada, Sierra, etc. in one cluster (cluster-1), and mid-range GDP and population, such as Sacramento, Riverside, etc. in another cluster (cluster-4).

形成了包含县的4个集群。 经检查,发现洛杉矶县因其GDP和人口相对较高而与其自身形成了一个集群(集群2)。 另一个集群中的县的GDP较高且人口众多,但洛杉矶县附近没有。 奥兰治,圣克拉拉和圣地亚哥是该集群中的三个县(集群3)。 然后是一个集群(集群1)中的Plumas,内华达州,塞拉利昂等GDP较低且人口较少的县(另一个集群)(萨克拉曼多,河滨等)中部GDP和人口较低的县(集群) -4)。

Image for post
Resulting Clusters on a Map, source: Author
地图上的结果集群,来源:作者

In clusters 2, 3 we have counties with a high population and high GDP. In these counties, it will be profitable to invest in any eatery while it is advisable to invest in an eatery that is not in the top 3 venues.

在第2、3组中,我们的县人口众多,GDP很高。 在这些县中,投资于任何一家餐馆都是有利可图的,而建议投资于不在前三名场所中的餐馆则是有利的。

In cluster 4, the population and GDP of counties are higher than those of the counties in cluster 1 but lower than those of counties in 2 or 3. Investment in these counties is preferred after a county in cluster 2 and cluster 3, in that order. Investment should be done in uncommon eateries so that they face lesser competition.

在集群4中,县的人口和GDP高于集群1中的县,但低于集群2或3中的县。在这些县中投资优先于集群2和集群3中的县。 。 应该在不常见的餐馆里进行投资,以使他们面临的竞争更少。

Cluster 1 is dominated by lower population counties. Investment in these counties should be preferred after investments in counties in clusters 2 or 3 or cluster 4. Investment in most common eateries is not advised at all. Investment in these counties is least advised.

集群1由人口较少的县主导。 在对第2组或第3组或第4组的县进行投资之后,应该优先选择对这些县进行投资。 建议不要在这些县进行投资。

After suggesting investment options, tables for each cluster were formed with eatery types, not in the three most common types.

在提出投资选择建议之后,每个集群的表格都是用餐馆类型构成的,而不是三种最常见的类型。

Image for post
Table for Counties and Investment Recommendations in Cluster 3
表3中的县和投资建议表

Full Report Link: PDF in GitHub RepositoryNotebook with Full Code: NB Viewer

完整报告链接: GitHub存储库笔记本中的PDF ,完整代码: NB Viewer

Feel free to comment, provide feedback, or criticize.

随时发表评论,提供反馈或批评。

Connect with me on LinkedIn or Twitter.

在LinkedIn或Twitter上与我联系。

This blog post is related to Applied Data Science Capstone Project offered by IBM through Coursera.

这篇博客文章与IBM通过Coursera提供的Applied Data Science Capstone Project有关。

翻译自: https://medium.com/beginning-data-science/investing-in-a-new-eastery-in-california-a-data-driven-approach-e91229e0289e

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390940.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

阿里云ESC上的Ubuntu图形界面的安装

系统装的是Ubuntu Server 16.04 64位版的图形界面,这里是转载的一个大神的帖子 http://blog.csdn.net/dk_0228/article/details/54571867, 当然自己也再记录一下,加深点印象 1.更新apt-get 保证最新 apt-get update 2.用putty或者Xshell连接远…

近似算法的近似率_选择最佳近似最近算法的数据科学家指南

近似算法的近似率by Braden Riggs and George Williams (gwilliamsgsitechnology.com)Braden Riggs和George Williams(gwilliamsgsitechnology.com) Whether you are new to the field of data science or a seasoned veteran, you have likely come into contact with the te…

VMware安装CentOS之二——最小化安装CentOS

1、上文已经创建了一个虚拟机,现在我们点击开启虚拟机。2、虚拟机进入到安装的界面,在这里我们选择第一行,安装或者升级系统。3、这里会提示要检查光盘,我们直接选择跳过。4、这里会提示我的硬件设备不被支持,点击OK&a…

在Python中使用Seaborn和WordCloud可视化YouTube视频

I am an avid Youtube user and love watching videos on it in my free time. I decided to do some exploratory data analysis on the youtube videos streamed in the US. I found the dataset on the Kaggle on this link我是YouTube的狂热用户,喜欢在业余时间…

老生常谈:抽象工厂模式

在创建型模式中有一个模式是不得不学的,那就是抽象工厂模式(Abstract Factory),这是创建型模式中最为复杂,功能最强大的模式.它常与工厂方法组合来实现。平时我们在写一个组件的时候一般只针对一种语言,或者说是针对一个区域的人来实现。 例如:现有有一个新闻组件,在中国我们有…

数据结构入门最佳书籍_最佳数据科学书籍

数据结构入门最佳书籍Introduction介绍 I get asked a lot what resources I recommend for people who want to start their Data Science journey. This section enlists books I recommend you should read at least once in your life as a Data Scientist.我被很多人问到…

函数式编程概念

什么是函数式编程 简单地说,函数式编程通过使用函数,将值转换成抽象单元,接着用于构建软件系统。 面向对象VS函数式编程 面向对象编程 面向对象编程认为一切事物皆对象,将现实世界的事物抽象成对象,现实世界中的关系抽…

多重插补 均值插补_Feature Engineering Part-1均值/中位数插补。

多重插补 均值插补Understanding the Mean /Median Imputation and Implementation using feature-engine….!了解使用特征引擎的均值/中位数插补和实现…。! 均值或中位数插补: (Mean or Median Imputation:) The mean or median value should be calc…

linux 查看用户上次修改密码的日期

查看root用户密码上次修改的时间 方法一:查看日志文件: # cat /var/log/secure |grep password changed 方法二: # chage -l root-----Last password change : Feb 27, 2018 Password expires : never…

客户行为模型 r语言建模_客户行为建模:汇总统计的问题

客户行为模型 r语言建模As a Data Scientist, I spend quite a bit of time thinking about Customer Lifetime Value (CLV) and how to model it. A strong CLV model is really a strong customer behavior model — the better you can predict next actions, the better yo…

【知识科普】解读闪电/雷电网络,零基础秒懂!

知识科普,解读闪电/雷电网络,零基础秒懂! 闪电网络的技术是革命性的,将实现即时0手续费的小金额支付。第一步是解决扩容问题,第二部就是解决共通性问题,利用原子交换协议和不同链条的状态通道结合&#xff…

Alpha 冲刺 (5/10)

【Alpha go】Day 5! Part 0 简要目录 Part 1 项目燃尽图Part 2 项目进展Part 3 站立式会议照片Part 4 Scrum 摘要Part 5 今日贡献Part 1 项目燃尽图 Part 2 项目进展 已分配任务进度博客检索功能:根据标签检索流程图 -> 实现 -> 测试近期比…

多维空间可视化_使用GeoPandas进行空间可视化

多维空间可视化Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ …

机器学习 来源框架_机器学习的秘密来源:策展

机器学习 来源框架成功的机器学习/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence) It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of ma…

WebLogic调用WebService提示Failed to localize、Failed to create WsdlDefinitionFeature

在本地Tomcat环境下调用WebService正常&#xff0c;但是部署到WebLogic环境中&#xff0c;则提示警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_2 ......警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_1 ..…

呼吁开放外网_服装数据集:呼吁采取行动

呼吁开放外网Getting a dataset with images is not easy if you want to use it for a course or a book. Yes, there are many datasets with images, but few of them are suitable for commercial or educational use.如果您想将其用于课程或书籍&#xff0c;则获取带有图像…

React JS 组件间沟通的一些方法

刚入门React可能会因为React的单向数据流的特性而遇到组件间沟通的麻烦&#xff0c;这篇文章主要就说一说如何解决组件间沟通的问题。 1.组件间的关系 1.1 父子组件 ReactJS中数据的流动是单向的&#xff0c;父组件的数据可以通过设置子组件的props传递数据给子组件。如果想让子…

数据可视化分析票房数据报告_票房收入分析和可视化

数据可视化分析票房数据报告Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.欢迎回到我的100天数据科学挑战之旅。 在第4天和第5天&#xff0c;我将研究Kaggle上提供的TM…

先知模型 facebook_Facebook先知

先知模型 facebook什么是先知&#xff1f; (What is Prophet?) “Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series …

搭建Maven私服那点事

摘要&#xff1a;本文主要介绍在CentOS7.1下使用nexus3.6.0搭建maven私服&#xff0c;以及maven私服的使用&#xff08;将自己的Maven项目指定到私服地址、将第三方项目jar上传到私服供其他项目组使用&#xff09; 一、简介 Maven是一个采用纯Java编写的开源项目管理工具, Mave…