多维空间可视化
Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ worth based on location.
最近,我在一个项目中尝试建立一个可以预测华盛顿金县(西雅图周边地区)房价的模型。 在查看了这些功能之后,我想找到一种根据位置确定房屋价值的方法。
The dataset included latitude and longitude and it was easy to google them to take a look at the houses, their neighborhoods, their distance from the water, etc. But with over 17000 observations, that was a fool’s task. I had to find an easier way.
数据集包括纬度和经度,可以很容易地用谷歌浏览一下房屋,附近,距水的距离等。但是,通过17000多个观察,这是一个傻瓜的任务。 我必须找到一种更简单的方法。
I had used Geographic Information Systems (GIS) only once before but not in Python. So I did what I do best: I googled, and ran into this amazing package called GeoPandas. I am going to let the GeoPandas team sum up what they do because they can say much better than I can.
我以前只使用过一次地理信息系统(GIS),而没有在Python中使用过。 因此,我做了我最擅长的事情:我搜索了Google,并遇到了一个名为GeoPandas的惊人软件包。 我要让GeoPandas团队总结他们所做的事情,因为他们的发言能力比我更好。
GeoPandas is an open source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. GeoPandas further depends on fiona for file access and descartes and matplotlib for plotting. — Description from GeoPandas Website (2020)
GeoPandas是一个开源项目,可简化使用python中的地理空间数据的工作。 GeoPandas扩展了熊猫使用的数据类型,以允许对几何类型进行空间操作。 几何运算是通过匀称进行的。 GeoPandas进一步依赖于fiona进行文件访问,并依赖笛卡尔和matplotlib进行绘图。 — GeoPandas网站(2020)的说明
This blew my mind, and what I wanted was really just the most basic of the features. I am going to show you how to run this code and do what I did — plotting accurate points on a map.
这让我大吃一惊,而我想要的实际上只是最基本的功能。 我将向您展示如何运行此代码并完成我的工作-在地图上绘制准确的点。
You are going to need several packages and some files in addition to the basic pandas and matplotlib. They include:
除了基本的pandas和matplotlib外,您还需要几个软件包和一些文件。 它们包括:
- geopandas — the package that makes all of this possible geopandas-使所有这些成为可能的软件包
shapely — package for manipulation and analysis of planar geometric objects
匀称 —用于处理和分析平面几何对象的程序包
descartes — provides a nicer integration of Shapely geometry objects with Matplotlib. It’s not needed every time but I import it just to be safe
笛卡尔(笛卡尔) -将Shapely几何对象与Matplotlib更好地集成。 并非每次都需要它,但为了安全起见我将其导入
- Any .shp file — this is going to be the backdrop of the plot. Mine is going to have King County, but you should be able to find one from any city’s data department. Don’t delete any files from the .zip file it comes in. Something always breaks. 任何.shp文件-这将是情节的背景。 我的将有金县,但您应该可以从任何城市的数据部门中找到一个。 不要从它所包含的.zip文件中删除任何文件。总有东西会中断。
More information about shapefiles can be found here, but the long and short of it is that these aren’t normal images. They are a vector data storage format that has information linking to locations — coordinates and the rest.
关于shapefile的更多信息可以在这里找到,但总的来说,它们不是正常图像。 它们是矢量数据存储格式,具有链接到位置(坐标和其余位置)的信息。
First I imported the basic packages that I needed and then the new packages:
首先,我导入了所需的基本软件包,然后导入了新软件包:
import matplotlib.pyplot as plt
import numpy as np from shapely.geometry import Point,Polygon
import geopandas as gpd
import descartes
The Point and Polygon features are what help me match my data to the map I make.
点和多边形功能可以帮助我将数据与我制作的地图进行匹配。
Next, I load in my data. This is basic pandas but for those that are new, everything in quotations is the name of the file I had to access the housing records.
接下来,我加载我的数据。 这是基本的大熊猫,但对于新熊猫,引号中的所有内容都是我必须访问房屋记录的文件的名称。
df = pd.read_csv('kc_house_data_train.csv')
With all of the packages imported and the data ready to go, I wanted to take a look at the map I was going to be plotting. I did this by finding a shape file made by the King County government website. They have done all the hard work of surveying and cataloging the land — it would be rude to not use their freely offered services. Loading in the shape file is easy and comparable to loading in a csv file with pandas.
导入了所有软件包并准备好数据后,我想看一下我要绘制的地图。 我通过查找金县政府网站制作的形状文件来完成此操作。 他们已经完成了土地测量和分类的所有艰苦工作-不使用免费提供的服务是不礼貌的。 加载到shape文件中很容易,并且与使用pandas加载到csv文件中相当。
kings_county = gpd.read_file('*file_path_here*/School_Districts_in_King_County___schdst_area.shp')
You can open this up if you want to take a look at the data. The King County shape file was just a dataframe of locations matched with their school districts, geometry coordinates, and area. But the best part is when we plot it and yes, we have to plot it. This isn’t an image you can just call — it will have the coordinates built in so our data can be placed down like a point on a 5th grade (x,y) graph.
如果要查看数据,可以打开此窗口。 金县形状文件只是与他们的学区,几何坐标和面积相匹配的位置的数据框。 但是最好的部分是当我们绘制它时,是的,我们必须绘制它。 这不是您只能调用的图像-它具有内置的坐标,因此我们的数据可以像5级(x,y)图上的点一样放置。
Using the below code (notice how I edited it the same way I would edit a graph):
使用下面的代码(注意,我以与编辑图形相同的方式对其进行编辑):
fig, ax = plt.subplots(figsize = (15,15))
kings_county.plot(ax=ax)
ax.set_title('King County',fontdict = {'fontsize': 30})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
My output looked like this:
我的输出看起来像这样:
Before we start adding our housing data we should look at utilizing the shape file to the fullest. Let’s take a look at the file.
在开始添加房屋数据之前,我们应该充分利用形状文件。 让我们看一下文件。
OID D# NAME geometry
0 1 1 Seattle MULTIPOLYGON (((-122.40324 47.66637...
1 2 210 Federal Way POLYGON ((-122.29057 47.39374...
2 3 216 Enumclaw POLYGON ((-121.84898 47.34708...
3 4 400 Mercer Island POLYGON ((-122.24475 47.59601...
4 5 401 Highline POLYGON ((-122.35853 47.51553...- Truncated for clarity
As you can see, the county is divided on school districts — each with a shape used as boundaries. We will now try to plot the shape file and annotate the districts using the data provided like so:
如您所见,该县分为多个学区-每个学区的形状都用作边界。 现在,我们将尝试绘制形状文件并使用提供的数据对区域进行注释,如下所示:
left = ['Riverview','Snoqualmie Valley']
center = ['Skykomish','Kent','Auburn','Tahoma','VashonIsland','Northshore','Shoreline','Renton','Highline','Issaquah','Enumclaw','Seattle','FederalWay','Bellevue','Mercer Island','LakeWashington','Tukwila']
right = ['Fife']
kings_county.plot(figsize = (15,15),cmap = 'gist_earth')
for idx, row in kings_county.iterrows():if row['NAME'] in left:plt.annotate(s=row['NAME'], xy=row['coords'],ha='left', color = 'red')elif row['NAME'] in center:plt.annotate(s=row['NAME'], xy=row['coords'],ha='center', color = 'red')elif row['NAME'] in right:plt.annotate(s=row['NAME'], xy=row['coords'],ha='right', color = 'red')
plt.title('School Districts in Kings County, WA', fontdict = {'fontsize': 20})
plt.ylabel('Latitude',fontdict = {'fontsize': 20})
plt.xlabel('Longitude',fontdict = {'fontsize': 20})
The lists — left, right, center — are from trial and error with the placement of the district names. Some overlapped or needed to be manipulated so that they did not stray too far from their actual district.
列表(左,右,中心)来自地区名称的放置,反复尝试。 有些重叠或需要进行操纵,以使它们不会偏离实际区域。
I’ve changed the color map to gist_earth for clarity. Next, I iterated through each row using the entry in the NAME series, and placing the title at a point that was definitely in the polygon. I aligned the names based on the lists I had made earlier. And this was out output:
为了清楚起见,我将颜色映射更改为gist_earth 。 接下来,我使用NAME系列中的条目遍历每一行,并将标题放置在肯定位于多边形中的点上。 我根据之前的清单排列了名称。 这是输出:
Each of the regions signifies a school district in King County. This matches the data I found about the twenty school districts in the county. I never really thought about the size and shape of a county, so I googled it just to be sure.
每个地区都代表金县的学区。 这与我发现的有关该县二十个学区的数据相匹配。 我从来没有真正考虑过一个县的大小和形状,所以我用谷歌搜索只是为了确定。
It seemed like the Google Maps image was the perfect hole for my puzzle piece. From here, it was just a matter of formatting my data to fit the shape file. I did that by initiating my coordinate system and creating applicable points using the latitude and longitude of my houses.
似乎Google Maps图像是我的拼图的完美选择。 从这里开始,只需要格式化我的数据以适合形状文件即可。 我通过启动坐标系并使用房屋的纬度和经度来创建适用的点来完成此操作。
crs = {'init': 'epsg:4326'} # initiating my coordinate system
geometry = [Point(x,y) for x,y in zip(df.long,df.lat)] # creating points
If you were to look at an entry in geometry, you only get back that they are shapely objects. They need to be applied to our original dataframe. Below, you can see as I make a brand new dataframe that has the coordinate system built in, the old dataframe, and the addition of the points created by the intersection of the Latitude and Longitude of the houses.
如果要查看几何图形中的条目,您只会发现它们是匀称的对象。 它们需要应用于我们的原始数据框。 在下面,您可以看到当我制作一个全新的数据框时,该数据框内置了坐标系,旧的数据框,并添加了房屋的经度和纬度相交点。
geo_df = gpd.GeoDataFrame(df, # the dataframecrs = crs, # coordinate systemgeometry = geometry) # geometric points
That was the last step before we can plot the houses. Now, we put it all together.
那是我们绘制房屋之前的最后一步。 现在,我们将所有内容放在一起。
fig, ax = plt.subplots(figsize = (15,16))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df.plot(ax = ax , markersize = 2, color = 'blue',marker ='o',label = 'House', aspect = 1)
plt.legend(prop = {'size':10} )
ax.set_title('Houses in Kings County, WA', fontdict = {'fontsize':20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
在上面的代码中,步骤包括: (In the code above, the steps include:)
- Calling an object to plot. 调用对象进行绘图。
- Plotting the King County shape file. 绘制金县形状文件。
Plotting the data I made that includes the geometry point.
绘制我制作的包括几何点的数据。
This includes making markers, choosing the aspect, and adding the label for the legend.
这包括制作标记,选择外观以及为图例添加标签。
- Adding a legend, title, and axis labels. 添加图例,标题和轴标签。
These steps were done for each of the graphs.
对每个图形都完成了这些步骤。
Our output:
我们的输出:
This is a great product but our goal is to learn something from this visualization. While this gives some information, like the outliers far to the eastern part of the county, it doesn’t give much else. We have to play with parameters. Let’s try splitting the data by price. These are the houses that are listed for less than $750,000.
这是一个很棒的产品,但是我们的目标是从可视化中学习一些东西。 尽管这提供了一些信息,例如该县东部的离群值,但它并没有提供其他信息。 我们必须使用参数。 让我们尝试按价格划分数据。 这些房屋的标价低于750,000美元。
fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] < 750000].plot(ax = ax , markersize = 2,color = 'red',marker = 's',label = 'Price < 750k',aspect = 1.5)
plt.legend(prop = {'size':15} )
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
Now we graph the houses greater than or equal to $750,000.
现在我们绘制大于或等于750,000美元的房子的图。
fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] >= 750000].plot(ax = ax , markersize = 2,color = 'yellow',marker = 'v',label = 'Price >=750k', aspect = 1.5)
plt.legend(prop = {'size':15})
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
There is a big difference in terms of both location and quantity. But that is not the end, we can also layer them one on top of the other. We will be doing the expensive on top of the cheap because it is scarcer.
在位置和数量上都存在很大差异。 但这还没有结束,我们也可以将它们一个接一个地放置。 我们将在便宜的基础上再做昂贵的,因为它稀缺。
fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] < 750000].plot(ax = ax , markersize = 1,color = 'red',marker = 's',label = 'Price <750k = Red', aspect = 1.5)
geo_df[geo_df['price'] >= 750000].plot(ax = ax , markersize = 1,color = 'yellow',marker = 'v',label = 'Price>= 750k = Yellow',aspect = 1.5)
plt.legend(prop = {'size':12})
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
The picture painted by this map is interesting. There is a plethora of housing in King County that falls below the bar we’ve set. Most of the houses on the lower end of the price scale falls more inland than the more expensive classes.
该地图绘制的图片很有趣。 金县的住房过多,低于我们设定的标准。 价格范围较低端的大多数房屋比昂贵的房屋价格下跌的地区更多。
If you zoom in, the more expensive houses dot the waterside. They also are more centrally located around the Seattle city center. There are several physical outliers but the trend is clear.
如果放大,则较贵的房屋将点缀在水边。 它们还位于西雅图市中心附近的中心位置。 有几个物理异常值,但趋势很明显。
Overall, the visualization has done its job. We have made several determinations from the houses on the map. Pricier houses are collected around the downtown area and spread around Puget Sound. They are also a minority in the data, which could be telling for predicting housing prices. The houses priced on the cheaper side are much more numerous and have a varied location. This will be useful for further EDA.
总体而言,可视化已完成工作。 我们已经从地图上的房屋中做出了一些决定。 价格较高的房屋在市区周围收集,并分布在普吉特海湾附近。 他们也是数据中的少数,这可能有助于预测房价。 价格便宜的房屋数量更多,并且位置各异。 这对于进一步的EDA很有用。
If you want to connect to talk more about this technique, you can find me on LinkedIn. If you would like to check out the code, take a look at my Github.
如果您想联系以更多地谈论这种技术,可以在LinkedIn上找到我。 如果您想查看代码,请查看我的Github 。
资料来源 (Sources)
King County Dataset — here
金县数据集- 此处
King County Shape File —
金县形状文件—
here
这里
Geopandas
大熊猫
Shapely
匀称
Descartes
笛卡尔
Fiona
菲奥娜
翻译自: https://towardsdatascience.com/using-geopandas-for-spatial-visualization-21e78984dc37
多维空间可视化
本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390912.shtml
如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!