“It is difficult to make predictions, especially about the future.”
“很难做出预测，尤其是对未来的预测。”

~Niels Bohr
〜尼尔斯·波尔

Everything is better interpreted through data. And data-driven decision making is crucial for success in any industry.

通过数据可以更好地解释一切。数据驱动的决策对于任何行业的成功都是至关重要的。

And it has been true since time immemorable. The difference now is that we have, for better, developed a healthy outlook to data, and we have much more data available to us than previous times. And we have, in our disposal, computing powers previously unimagined.

自从难忘的时光以来，这就是事实。现在的区别在于，我们更好地发展了健康的数据前景，并且我们拥有比以前更多的数据。而且，我们拥有以前无法想象的计算能力。

In this situation, the computing power and the data should be leveraged to make better decisions to solve business problems.

在这种情况下，应利用计算能力和数据做出更好的决策来解决业务问题。

In my project, I chose to provide recommendations for opening new eateries in California City. In this project, I provided a concrete list of recommendations to invest in. Eatery types (such as- Japanese restaurant, dessert shop, etc.) and respective counties were suggested.

在我的项目中，我选择为在加利福尼亚市开设新餐馆提供建议。在这个项目中，我提供了一份具体的投资建议清单。对餐馆类型(例如日式餐厅，甜点店等)和各个县提出了建议。

In this post, I will go over the full process of a Data Science project.

在本文中，我将介绍数据科学项目的整个过程。

数据源 (Data Sources)

For solving this problem, data from four sources have been leveraged-

为了解决这个问题，我们利用了来自四个来源的数据-

Location data titled “California Counties” provided in California Open Data Portal provided by the Government of California for the geographical location data.
由加利福尼亚政府提供的加利福尼亚开放数据门户中提供的地理位置数据称为“加利福尼亚县” 。
The Foursquare API for information about established restaurants and other relevant detailed information about the same.
Foursquare API，用于提供有关已建立餐厅的信息以及有关该餐厅的其他相关详细信息。
County-wise population data from the US Government Census site.
来自美国政府人口普查站点的县级人口数据。
County-wise Real GDP data provided by the Bureau of Economic Analysis, U.S. Department of Commerce.
美国商务部经济分析局提供的县级实际GDP数据。

探索性数据分析 (Exploratory Data Analysis)

After cleaning the data (which is definitely more than 90% of a Data Scientist’s job), meaningful insights were gained from the data.

清理数据后( 绝对超过数据科学家工作的90％ )，从数据中获得了有意义的见解。

Image for post — City Centers of California’s Counties, source: Author

It was also found that the GDPs of the counties are strongly correlated with the Populations of the counties. Thus making counties with high GDPs and high populations attractive destination of investment.

还发现县的GDP与县的人口密切相关。因此，具有高GDP和高人口的县成为吸引投资的目的地。

With the information provided by the Foursquare API, a list of ten most common venues was obtained for each county. This will be leveraged in decision making.

借助Foursquare API提供的信息，获得了每个县的十个最常见的场所列表。这将在决策中加以利用。

应用机器学习模型 (Applying Machine Learning Model)

选择算法 (Choosing Algorithm)

The business problem is to look for eatery types and locations to invest in. The data is not labeled. This renders the problem to be solved a classical application of unsupervised learning.

业务问题是寻找餐馆类型和投资地点。数据未标记。这使得要解决的问题成为无监督学习的经典应用。

The aim is not to look for value or look for a class. The aim is not to suggest someone only one recommendation for investment. To suggest the stakeholders a list of likely venues is the goal.

目的不是寻找价值或寻找阶级。目的不是建议某人仅提出一项投资建议。向利益相关者建议可能的场所清单是目标。

And this can be achieved by clustering the counties based on GDP and Population. And KMeans Clustering is the best Statistical Learning algorithm to achieve this.

这可以通过基于GDP和人口对县进行聚类来实现。而KMeans聚类是实现这一目标的最佳统计学习算法。

Scikit-learn library’s implementation for the KMeans Clustering algorithm was used.

使用了Scikit-learn库的KMeans聚类算法实现。

选择k (Choosing k)

For choosing the best k for clustering, the elbow method was employed.

为了选择最佳的k进行聚类，采用了弯头法。

As evident from the graph, the best k is 4. Hence, the clustering algorithm was applied with k = 4. So, 4 clusters of counties were formed based on population and GDP of the counties.

从图中可以看出，最佳k为4。因此，在k = 4时应用了聚类算法。因此，根据县的人口和GDP形成了4个县集群。

结果 (Results)

4 clusters were formed containing counties. Upon examination, it was found that Los Angeles county formed one cluster (cluster-2) with itself due to its comparatively abysmally high GDP and population. Counties in another cluster had high GDP and high population, but not anywhere close to the Los Angeles county. Orange, Santa Clara, and San Diego are the three counties in this cluster (cluster-3). Then there are counties with low GDP and low populations such as Plumas, Nevada, Sierra, etc. in one cluster (cluster-1), and mid-range GDP and population, such as Sacramento, Riverside, etc. in another cluster (cluster-4).

形成了包含县的4个集群。经检查，发现洛杉矶县因其GDP和人口相对较高而与其自身形成了一个集群(集群2)。另一个集群中的县的GDP较高且人口众多，但洛杉矶县附近没有。奥兰治，圣克拉拉和圣地亚哥是该集群中的三个县(集群3)。然后是一个集群(集群1)中的Plumas，内华达州，塞拉利昂等GDP较低且人口较少的县(另一个集群)(萨克拉曼多，河滨等)中部GDP和人口较低的县(集群) -4)。

In clusters 2, 3 we have counties with a high population and high GDP. In these counties, it will be profitable to invest in any eatery while it is advisable to invest in an eatery that is not in the top 3 venues.

在第2、3组中，我们的县人口众多，GDP很高。在这些县中，投资于任何一家餐馆都是有利可图的，而建议投资于不在前三名场所中的餐馆则是有利的。

In cluster 4, the population and GDP of counties are higher than those of the counties in cluster 1 but lower than those of counties in 2 or 3. Investment in these counties is preferred after a county in cluster 2 and cluster 3, in that order. Investment should be done in uncommon eateries so that they face lesser competition.

在集群4中，县的人口和GDP高于集群1中的县，但低于集群2或3中的县。在这些县中投资优先于集群2和集群3中的县。。应该在不常见的餐馆里进行投资，以使他们面临的竞争更少。

Cluster 1 is dominated by lower population counties. Investment in these counties should be preferred after investments in counties in clusters 2 or 3 or cluster 4. Investment in most common eateries is not advised at all. Investment in these counties is least advised.

集群1由人口较少的县主导。在对第2组或第3组或第4组的县进行投资之后，应该优先选择对这些县进行投资。建议不要在这些县进行投资。

After suggesting investment options, tables for each cluster were formed with eatery types, not in the three most common types.

在提出投资选择建议之后，每个集群的表格都是用餐馆类型构成的，而不是三种最常见的类型。