介绍 (Introduction)
This blog post summarizes the results of the Capstone Project in the IBM Data Science Specialization on Coursera. Within the project, the districts of Frankfurt am Main in Germany shall be clustered according to their venue data using the K-Means clustering algorithm. The first section describes the Business problem that we will be dealing with. Then we shall take a look at the data that can be used to solve the problem and the methodology for finding a solution.
这篇博客文章总结了Coursera上IBM Data Science Specialization中Capstone项目的结果。 在项目内,应使用K-Means聚类算法根据其场地数据对德国美因河畔法兰克福地区进行聚类。 第一部分描述了我们将要处理的业务问题。 然后,我们将研究可用于解决问题的数据和找到解决方案的方法。
业务问题 (Business Problem)
A client is interested in opening a franchise of their Asian restaurant chain in the city of Frankfurt am Main, preferably close to the city center. It will be their first restaurant in the city, and they want us to find out which would be the best neighborhood/district to open an Asian restaurant in the city. Additionally, the results of the clustering algorithm t can also be used by someone interested in moving to Frankfurt and wanting to know about the cuisines available in the various districts.
客户有兴趣在美因河畔法兰克福市(最好是靠近市中心)开设其亚洲餐厅连锁店的特许经营权。 这将是他们在这座城市的第一家餐厅,他们希望我们找出哪一个是在城市开设亚洲餐厅的最佳社区/地区。 另外,聚类算法t的结果也可以供有兴趣移居法兰克福并希望了解各个地区可用美食的人使用。
数据 (Data)
Following datasets have been used in this project:
在该项目中使用了以下数据集:
Street Directory of the city of Frankfurt am Main: https://offenedaten.frankfurt.de/dataset/strassenverzeichnis-der-stadt-frankfurt-am-main
美因河畔法兰克福市街道目录: https : //offenedaten.frankfurt.de/dataset/strassenverzeichnis-der-stadt-frankfurt-am-main
- Foursquare API to get the most common venues in Frankfurt districts. Foursquare API获得法兰克福地区最常见的场所。
Demographics of Frankfurt am Main Neighborhoods : https://offenedaten.frankfurt.de/dataset/stadtteilprofile-bevoelkerung
法兰克福主要社区的人口统计学: https : //offenedaten.frankfurt.de/dataset/stadtteilprofile-bevoelkerung
Election Atlas 2015 — GeoJSON Frankfurt neighborhoods: https://offenedaten.frankfurt.de/dataset/wahlatlas-2015-geodaten/resource/84dff094-ab75-431f-8c64-39606672f1da
2015年选举地图集-法兰克福GeoJSON社区: https : //offenedaten.frankfurt.de/dataset/wahlatlas-2015-geodaten/resource/84dff094-ab75-431f-8c64-39606672f1da
数据收集与清理 (Data Gathering and cleaning)
We will analyze the districts of the city of Frankfurt am Main in this project. The datasets are available as CSV files which can be converted into a pandas dataframe using the pd.read_csv function inbuilt in pandas.
我们将在此项目中分析美因河畔法兰克福市的地区。 数据集以CSV文件形式提供,可以使用内置在pandas中的pd.read_csv函数将其转换为pandas数据框。
Data 1: Street directory of Frankfurt am Main:
数据1:美因河畔法兰克福的街道目录:
This dataset will be used to extract the district names and postcodes in Frankfurt. It is available as a CSV file and can be accessed via the link given above. Frankfurt contains 46 city districts. This is a huge dataset containing 4540 rows and 15 columns. Therefore, it was necessary to shorten and clean it by keeping only the data that is required. It is a street directory, which is why the dataset is so big. It was shortened to extract only the district names and postcodes. The resultant dataset contained 46 rows (one for each district) and 3 columns.
该数据集将用于提取法兰克福的地区名称和邮政编码。 它以CSV文件的形式提供,可以通过上面给出的链接进行访问。 法兰克福包含46个市区。 这是一个巨大的数据集,包含4540行和15列。 因此,有必要通过仅保留所需的数据来缩短和清理它。 这是街道目录,因此数据集如此之大。 缩短了提取区域名称和邮政编码的时间。 结果数据集包含46行(每个区一个)和3列。
Data 2 :
数据2:
The geographical coordinates of the districts will be utilized as input for Foursquare API that will be leveraged to extract information for each district respectively. We will use the Foursquare API to explore the districts in Frankfurt. We use Foursquare API to get the most common venues for each district. Foursquare returns a JSON file, from which required data needs to be extracted. We only extract the venue name, category, and geographical coordinates for each venue. These are then stored in a separate dataframe, for use in clustering.
地区的地理坐标将被用作Foursquare API的输入,Foursquare API将被用于分别提取每个地区的信息。 我们将使用Foursquare API探索法兰克福地区。 我们使用Foursquare API获取每个地区最常见的场所。 Foursquare返回一个JSON文件,需要从中提取所需的数据。 我们仅提取每个场地的场地名称,类别和地理坐标。 然后将它们存储在单独的数据框中,以用于群集。
Data 3: Frankfurt Demographics:
资料3:法兰克福客层:
This dataset contains the district-wise distribution of population for the city of Frankfurt. It also contains useful data about the percentage of foreigners and specifically, population of various ethnicities in the districts. It contains 46 rows (one for each district) and 164 columns. It needs to be shortened to analyze. Only the required columns were picked from this dataset, which contained information about the total population of each district, population of foreigners, and so on. Moreover, the column names are in German. These were translated into English for easy understanding.
该数据集包含法兰克福市的区域人口分布。 它还包含有关外国人百分比,特别是各地区不同种族人口的有用数据。 它包含46行(每个区一个)和164列。 需要缩短分析时间。 从此数据集中仅选择了必需的列,其中包含有关每个地区的总人口,外国人的人口等信息。 此外,列名是德语。 这些被翻译成英文以便于理解。
Data 4: Frankfurt neighborhoods GeoJSON:
数据4:法兰克福社区GeoJSON:
The geoJSON file is required for plotting the Choropleth maps to analyze the demographics of Frankfurt districts. The district names in this file must match the district names in the dataset which is intended to be plotted. After checking, it was found that the districts of Bahnhofsviertel and Gutleutviertel are combined into a single district in the geoJSON file. Thus, the 2 district rows were merged in the demographics dataset. Also, there was an issue with the German letters containing umlauts, i.e. ü, ä, ö. Hence, districts containing these letters were also renamed as per the characters found in their equivalent names in the geoJSON file.
绘制Choropleth地图以分析法兰克福地区的人口统计信息时,需要geoJSON文件。 该文件中的区域名称必须与要绘制的数据集中的区域名称匹配。 检查之后,发现在geoJSON文件中,Bahnhofsviertel和Gutleutviertel的区域合并为一个区域。 因此,这2个地区行已合并到人口统计数据集中。 另外,包含变音符号(即ü,ä,ö)的德语字母也存在问题。 因此,包含这些字母的地区也根据geoJSON文件中相同名称中的字符进行了重命名。
方法 (Methodology)
Analytical Approach
分析方法
We shall first use k-means clustering to cluster the neighborhoods in Frankfurt. Frankfurt has 46 districts. We shall use the geocoder to get the geographical coordinates for each of these districts. We will use Foursquare API to explore the districts using their coordinates and get the most common venues in each district. Based on this information, we shall cluster the districts using k-means and take a look at each cluster. We need to look at clusters with a greater number of Asian and similar cuisine restaurants, as that indicates that there is demand for Asian cuisine in that cluster.
我们将首先使用k-means聚类对法兰克福的社区进行聚类。 法兰克福有46个区。 我们将使用地理编码器获取这些地区中每个地区的地理坐标。 我们将使用Foursquare API使用坐标来探索区域,并获取每个区域中最常见的场所。 基于此信息,我们将使用k均值对区域进行聚类,并查看每个聚类。 我们需要查看具有更多亚洲和类似美食餐厅的集群,因为这表明该集群中对亚洲美食有需求。
Then we shall use the demographics data to find the districts with a greater population and compare that with the cluster data. We shall find districts that have more Asian restaurants as well as a sizeable Asian population, as these will be ideal for opening a new Asian restaurant. Additionally, we shall also look at closeby districts with lesser Asian restaurants but a sizeable Asian population, as this is also a good prospect, due to less competition in the area.
然后,我们将使用人口统计数据查找人口较多的地区,并将其与聚类数据进行比较。 我们将找到拥有更多亚洲餐厅以及大量亚洲人口的地区,因为这些地区对于开设新的亚洲餐厅非常理想。 此外,我们还将关注亚洲餐馆较少但亚洲人口众多的附近地区,因为由于该地区竞争较少,这也是一个很好的前景。
The street directory dataset is scraped and sliced to ultimately obtain just a list of districts in Frankfurt am Main along with their postal codes.
街道目录数据集将被剪切和切片,最终仅可获得美因河畔法兰克福的地区列表以及其邮政编码。
We require the geographical coordinates of the districts to plot on a map using Folium. These are not readily available in the dataset. We obtain the latitude and longitude for each district using Geopy- geopy is a Python 2 and 3 client for several popular geocoding web services.
我们要求使用Folium在地图上绘制区域的地理坐标。 这些在数据集中并不容易获得。 我们使用Geopy获得每个地区的纬度和经度。geopy是Python 2和3客户端,用于几种流行的地理编码Web服务。
Geopy makes it easy for Python developers to locate the coordinates of addresses, cities, countries, and landmarks across the globe using third-party geocoders and other data sources to get the data.
Geopy使Python开发人员可以使用第三方地理编码器和其他数据源轻松获取全球地址,城市,国家和地标的坐标,以获取数据。
Next, the top 100 venues shall be fetched for each postal code. For this task, an API call to the Foursquare API is performed. The Foursquare API offers location data from all over the world for business purposes as well as for developers. The required format of the URL for performing an API call to the Foursquare API is displayed below. A developer only needs a free developer account.
接下来,应为每个邮政编码获取前100个场所。 对于此任务,执行对Foursquare API的API调用。 Foursquare API提供了来自世界各地的位置数据,用于商业目的以及开发人员。 下面显示了执行对Foursquare API的API调用所需的URL格式。 开发人员只需要一个免费的开发人员帐户。
The received venues are stored in a new dataframe. We check for the number of unique venue categories present in the data returned by Foursquare. It turns out there are 188 unique venue categories in Frankfurt.
接收到的场所将存储在新的数据框中。 我们检查Foursquare返回的数据中存在的唯一场所类别的数量。 事实证明,法兰克福有188个独特的场馆类别。
Next up, we need to prepare the data for the K-means clustering algorithm. It cannot work with textual data or more commonly known as categorical data. Hence we need to encode the data using one-hot encoding. The encoded data is then grouped by District name in order to have 1 row for each district. When the data gets grouped, the one-hot encoded categories get summed up if a venue category appears more than once within a district. In order to have values at the same scale and smaller than one, the mean of the frequency of occurrence of each category is calculated and stored.
接下来,我们需要为K-means聚类算法准备数据。 它不能与文本数据或更常用的分类数据一起使用。 因此,我们需要使用一键编码对数据进行编码。 然后按地区名称对编码数据进行分组,以便每个地区有1行。 对数据进行分组后,如果场所类别在一个区域中出现多次,则将对一键编码类别进行汇总。 为了使值具有相同的标度并且小于1,计算并存储每个类别的出现频率的平均值。
In order to get more insights into the data, the top 10 most common venues for each district are obtained and a separate dataframe is created to store these.
为了更深入地了解数据,获取了每个地区的前10个最常见的场所,并创建了一个单独的数据框来存储这些场所。
使用K均值聚类 (Clustering using K-means)
The one-hot encoded and grouped data is the input to the K-means algorithm and the number of clusters is set to five. We use the scikit-learn library for the K-means algorithm. The district column is dropped as it is textual data and we need to cluster using only the encoded values. The resulting cluster labels are then additionally stored in the data frame containing the ten most common venues for each district.
一键编码和分组的数据是K-means算法的输入,并且簇数设置为五个。 我们将scikit-learn库用于K-means算法。 区域列被删除,因为它是文本数据,因此我们只需要使用编码后的值进行聚类。 然后,将生成的聚类标签另外存储在包含每个地区十个最常见场所的数据框中。
The dataframe containing the cluster labels and top venues is then merged with the dataframe containing latitude and longitude as seen in image above. This data was then used to visualize the clusters on a map using Folium.
然后,将包含聚类标签和顶部地点的数据框与包含纬度和经度的数据框合并,如上图所示。 然后使用Folium将这些数据用于在地图上可视化群集。
We then look at each cluster and based on the most common venues, we can name them and make decisions on which cluster is suitable for opening a new Asian restaurant.
然后,我们查看每个集群,并根据最常见的场所进行命名,并确定哪个集群适合开设新的亚洲餐厅。
观察结果 (Observations)
We observe that the purple and light green clusters contain the most districts and the most number of venues. While the light green cluster contains more restaurants, the purple cluster contains more hotels, which indicates tourists. We can see that a variety of cuisines are offered in the light green cluster, indicating that they cater to a variety of customers. Most of the districts are located close to the city center. These factors make this cluster the most eligible for opening a new Asian restaurant.
我们观察到紫色和浅绿色的群集包含最多的区域和最多的场所。 浅绿色的群集包含更多的餐厅,而紫色的群集包含更多的酒店,表示游客。 我们可以看到,浅绿色群集中提供了多种美食,表明它们可以满足各种客户的需求。 大多数地区都靠近市中心。 这些因素使该集群最有资格开设新的亚洲餐厅。
The purple cluster, on the other hand, although it does not contain many restaurants, has a lot of hotels and is pretty close to the city center. Presence of hotels indicates an influx of tourists, some of them Asian, meaning more prospective customers and if one finds a location not too far from the city center, an Asian restaurant here could flourish.
另一方面,紫色群集虽然没有很多餐厅,但拥有许多旅馆,并且非常靠近市中心。 旅馆的存在表明游客的涌入,其中一些是亚洲人,这意味着潜在的顾客更多,如果发现离市中心不远的地点,这里的亚洲餐馆可能会兴旺。
To know which district specifically would be perfect for opening an Asian restaurant, we look at the district-wise demographics of Frankfurt am Main, and then explore districts from both the light green and purple clusters.
要了解哪个区域最适合开设亚洲餐厅,我们先看一下美因河畔法兰克福的区域人口统计信息,然后从浅绿色和紫色群集中探索区域。
数据探索-法兰克福人口统计 (Data Exploration — Frankfurt demographics)
The demographics dataset contains district-wise distribution of population for the city of Frankfurt. It also contains useful data about the percentage of foreigners and specifically, population of various ethnicities in the districts. Only the required columns were picked from this dataset, which contained information about the total population of each district, population of foreigners, and so on. This dataset was then merged with the dataset containing the latitude and longitudes of the districts. The resulting dataset is as seen below.
人口统计数据集包含法兰克福市的区域人口分布。 它还包含有关外国人百分比,特别是各地区不同种族人口的有用数据。 从该数据集中仅选择了必需的列,其中包含有关每个地区的总人口,外国人的人口等信息。 然后将此数据集与包含地区纬度和经度的数据集合并。 结果数据集如下所示。
使用Choropleth映射进行数据可视化 (Data visualization using Choropleth maps)
The data from the demographics dataset is then plotted on a Choropleth map to visualize the population distribution across the city of Frankfurt. This data will then be used to select districts based on the earlier clustering results to explore further.
然后,将人口统计数据集中的数据绘制在Choropleth地图上,以可视化法兰克福市的人口分布。 然后,将根据较早的聚类结果将这些数据用于选择地区,以进行进一步的探索。
From this map, we observe that the central districts have the highest populations in Frankfurt, along with the district of Flughafen on the outskirts.
从这张地图中,我们观察到法兰克福以及法兰克福郊区的Flughafen地区人口最多。
Next, we take a look at the distribution of Asian and Australian population in Frankfurt.
接下来,我们来看看法兰克福的亚洲和澳大利亚人口分布。
We can see from the above maps, that the districts of Bockenheim and Gallus have the highest population of Asians and Australians. Out of these, Bockenheim comes under the light green cluster, and Gallus comes under the purple cluster. These 2 neighborhoods are then explored to find out the number of Asian or similar cuisine restaurants in these districts.
从上面的地图我们可以看到,博肯海姆和盖洛斯地区的亚洲人和澳大利亚人数量最多。 其中,博肯海姆位于浅绿色的星团之下,而盖洛斯位于紫色的星团之下。 然后探索这两个街区,以找出这些地区中亚洲或类似餐厅的数量。
Bockenheim
博肯海姆
2. Gallus
2.捷拉斯
3. Niederrad
3.尼德拉德
结果和讨论 (Results and Discussion)
By clustering the districts in Frankfurt and subsequently analyzing the district-wise demographics of the city, and then merging the two findings, we could arrive at 3 prospective neighborhoods that would be ideal for opening an Asian restaurant in the city.
通过将法兰克福的各个区域进行聚类,然后分析该城市的区域人口统计资料,然后合并这两个发现,我们可以得出3个潜在的社区,这对于在该城市开设亚洲餐厅非常理想。
1. Bockenheim:
1.博肯海姆:
Bockenheim falls in the light green cluster and is very close to the city center. It has 7 Asian restaurants which shows that there is a lot of demand for Asian cuisine in the area. It also has the highest population of Asians in the city at 1586.
博肯海姆(Bockenheim)落在浅绿色的集群中,非常靠近市中心。 它拥有7家亚洲餐厅,这表明该地区对亚洲美食的需求很大。 1586年,该市也是亚洲人口最多的城市。
2. Gallus:
2.捷拉斯:
Gallus is in the purple cluster containing a greater number of hotels. It is not far from the city center and has 5 Asian restaurants indicating that there is demand here as well. It has the second-highest population of Asians in the city at 1512. Hence, this seems like a better option than Bockenheim for opening an Asian restaurant owing to lesser competition, similar Asian population, and more prospective customers in the form of tourists.
捷拉斯位于包含大量酒店的紫色集群中。 它距离市中心不远,有5家亚洲餐厅,表明这里也有需求。 在1512年,它是该市第二大亚裔人口。因此,这似乎比博肯海姆(Bockenheim)开设亚洲餐馆更好的选择,原因是竞争较少,亚洲人口相似,并且游客形式更趋于潜在客户。
3. Niederrad:
3.尼德拉德:
Niederrad is also in the purple cluster having more hotels. It is also not far from the city center but has only 1 Asian restaurant — much less than both Bockenheim and Gallus. Niederrad also has a sizeable Asian population at 929, although a bit less than the other 2 districts in contention. Since it is in the purple cluster, we can expect more tourists in this district. We see that there are 3 hotels in the area. This translates to more prospective customers. Hence, this also seems like a good alternative to Gallus owing to much lesser competition, proximity to the city center, and more tourists.
尼德拉德(Niederrad)也在紫色集群中,拥有更多的酒店。 它也离市中心不远,但是只有1家亚洲餐厅-比Bockenheim和Gallus都少得多。 尼德拉德(Niederrad)在929年的亚洲人口也相当可观,尽管在争夺中比其他两个地区要少一些。 由于它位于紫色集群中,因此我们可以期望这个地区有更多游客。 我们发现该地区有3家酒店。 这转化为更多潜在客户。 因此,由于竞争少,靠近市中心且游客多,这似乎是捷拉斯的一个不错的选择。
结论: (Conclusion:)
The neighborhoods in Frankfurt am Main were clustered and displayed on a map containing the results. The demographics were studied and based on the findings, 3 districts were found to be ideal as a solution to the Business problem of opening an Asian restaurant. The client can choose any of the 3 neighborhoods to open an Asian restaurant, based on their preferences, confidence, and affinity to risk-taking.
美因河畔法兰克福的社区被聚类并显示在包含结果的地图上。 研究了人口统计信息,并根据调查结果,发现了3个地区是解决开设亚洲餐厅的业务问题的理想选择。 客户可以根据自己的喜好,信心和对冒险的意愿,选择3个街区中的任何一个开设亚洲餐厅。
翻译自: https://medium.com/swlh/clustering-neighborhoods-in-frankfurt-am-main-using-k-means-bb805545fd00
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389349.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!