多维空间可视化_使用GeoPandas进行空间可视化

多维空间可视化

Recently, I was working on a project where I was trying to build a model that could predict housing prices in King County, Washington — the area that surrounds Seattle. After looking at the features, I wanted a way to determine the houses’ worth based on location.

最近,我在一个项目中尝试建立一个可以预测华盛顿金县(西雅图周边地区)房价的模型。 在查看了这些功能之后,我想找到一种根据位置确定房屋价值的方法。

The dataset included latitude and longitude and it was easy to google them to take a look at the houses, their neighborhoods, their distance from the water, etc. But with over 17000 observations, that was a fool’s task. I had to find an easier way.

数据集包括纬度和经度,可以很容易地用谷歌浏览一下房屋,附近,距水的距离等。但是,通过17000多个观察,这是一个傻瓜的任务。 我必须找到一种更简单的方法。

I had used Geographic Information Systems (GIS) only once before but not in Python. So I did what I do best: I googled, and ran into this amazing package called GeoPandas. I am going to let the GeoPandas team sum up what they do because they can say much better than I can.

我以前只使用过一次地理信息系统(GIS),而没有在Python中使用过。 因此,我做了我最擅长的事情:我搜索了Google,并遇到了一个名为GeoPandas的惊人软件包。 我要让GeoPandas团队总结他们所做的事情,因为他们的发言能力比我更好。

GeoPandas is an open source project to make working with geospatial data in python easier. GeoPandas extends the datatypes used by pandas to allow spatial operations on geometric types. Geometric operations are performed by shapely. GeoPandas further depends on fiona for file access and descartes and matplotlib for plotting. — Description from GeoPandas Website (2020)

GeoPandas是一个开源项目,可简化使用python中的地理空间数据的工作。 GeoPandas扩展了熊猫使用的数据类型,以允许对几何类型进行空间操作。 几何运算是通过匀称进行的。 GeoPandas进一步依赖于fiona进行文件访问,并依赖笛卡尔和matplotlib进行绘图。 — GeoPandas网站(2020)的说明

This blew my mind, and what I wanted was really just the most basic of the features. I am going to show you how to run this code and do what I did — plotting accurate points on a map.

这让我大吃一惊,而我想要的实际上只是最基本的功能。 我将向您展示如何运行此代码并完成我的工作-在地图上绘制准确的点。

You are going to need several packages and some files in addition to the basic pandas and matplotlib. They include:

除了基本的pandasmatplotlib外,您还需要几个软件包和一些文件 它们包括:

  • geopandas — the package that makes all of this possible

    geopandas-使所有这些成为可能的软件包
  • shapely — package for manipulation and analysis of planar geometric objects

    匀称 —用于处理和分析平面几何对象的程序包

  • descartes — provides a nicer integration of Shapely geometry objects with Matplotlib. It’s not needed every time but I import it just to be safe

    笛卡尔(笛卡尔) -将Shapely几何对象与Matplotlib更好地集成。 并非每次都需要它,但为了安全起见我将其导入

  • Any .shp file — this is going to be the backdrop of the plot. Mine is going to have King County, but you should be able to find one from any city’s data department. Don’t delete any files from the .zip file it comes in. Something always breaks.

    任何.shp文件-这将是情节的背景。 我的将有金县,但您应该可以从任何城市的数据部门中找到一个。 不要从它所包含的.zip文件中删除任何文件。总有东西会中断。

More information about shapefiles can be found here, but the long and short of it is that these aren’t normal images. They are a vector data storage format that has information linking to locations — coordinates and the rest.

关于shapefile的更多信息可以在这里找到,但总的来说,它们不是正常图像。 它们是矢量数据存储格式,具有链接到位置(坐标和其余位置)的信息。

First I imported the basic packages that I needed and then the new packages:

首先,我导入了所需的基本软件包,然后导入了新软件包:

import matplotlib.pyplot as plt
import numpy as np from shapely.geometry import Point,Polygon
import geopandas as gpd
import descartes

The Point and Polygon features are what help me match my data to the map I make.

多边形功能可以帮助我将数据与我制作的地图进行匹配。

Next, I load in my data. This is basic pandas but for those that are new, everything in quotations is the name of the file I had to access the housing records.

接下来,我加载我的数据。 这是基本的大熊猫,但对于新熊猫,引号中的所有内容都是我必须访问房屋记录的文件的名称。

df = pd.read_csv('kc_house_data_train.csv')

With all of the packages imported and the data ready to go, I wanted to take a look at the map I was going to be plotting. I did this by finding a shape file made by the King County government website. They have done all the hard work of surveying and cataloging the land — it would be rude to not use their freely offered services. Loading in the shape file is easy and comparable to loading in a csv file with pandas.

导入了所有软件包并准备好数据后,我想看一下我要绘制的地图。 我通过查找金县政府网站制作的形状文件来完成此操作。 他们已经完成了土地测量和分类的所有艰苦工作-不使用免费提供的服务是不礼貌的。 加载到shape文件中很容易,并且与使用pandas加载到csv文件中相当。

kings_county = gpd.read_file('*file_path_here*/School_Districts_in_King_County___schdst_area.shp')

You can open this up if you want to take a look at the data. The King County shape file was just a dataframe of locations matched with their school districts, geometry coordinates, and area. But the best part is when we plot it and yes, we have to plot it. This isn’t an image you can just call — it will have the coordinates built in so our data can be placed down like a point on a 5th grade (x,y) graph.

如果要查看数据,可以打开此窗口。 金县形状文件只是与他们的学区,几何坐标和面积相匹配的位置的数据框。 但是最好的部分是当我们绘制它时,是的,我们必须绘制它。 这不是您只能调用的图像-它具有内置的坐标,因此我们的数据可以像5级(x,y)图上的点一样放置。

Using the below code (notice how I edited it the same way I would edit a graph):

使用下面的代码(注意,我以与编辑图形相同的方式对其进行编辑):

fig, ax = plt.subplots(figsize = (15,15))
kings_county.plot(ax=ax)
ax.set_title('King County',fontdict = {'fontsize': 30})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})

My output looked like this:

我的输出看起来像这样:

Image for post
Graphic by Author
图形作者

Before we start adding our housing data we should look at utilizing the shape file to the fullest. Let’s take a look at the file.

在开始添加房屋数据之前,我们应该充分利用形状文件。 让我们看一下文件。

OID  D#  NAME                              geometry
0   1   1   Seattle           MULTIPOLYGON (((-122.40324 47.66637...
1   2   210 Federal Way       POLYGON ((-122.29057 47.39374...
2   3   216 Enumclaw          POLYGON ((-121.84898 47.34708...
3   4   400 Mercer Island     POLYGON ((-122.24475 47.59601...
4   5   401 Highline          POLYGON ((-122.35853 47.51553...- Truncated for clarity

As you can see, the county is divided on school districts — each with a shape used as boundaries. We will now try to plot the shape file and annotate the districts using the data provided like so:

如您所见,该县分为多个学区-每个学区的形状都用作边界。 现在,我们将尝试绘制形状文件并使用提供的数据对区域进行注释,如下所示:

left = ['Riverview','Snoqualmie Valley']
center = ['Skykomish','Kent','Auburn','Tahoma','VashonIsland','Northshore','Shoreline','Renton','Highline','Issaquah','Enumclaw','Seattle','FederalWay','Bellevue','Mercer Island','LakeWashington','Tukwila']
right = ['Fife']
kings_county.plot(figsize = (15,15),cmap = 'gist_earth')
for idx, row in kings_county.iterrows():if row['NAME'] in left:plt.annotate(s=row['NAME'], xy=row['coords'],ha='left', color = 'red')elif row['NAME'] in center:plt.annotate(s=row['NAME'], xy=row['coords'],ha='center', color = 'red')elif row['NAME'] in right:plt.annotate(s=row['NAME'], xy=row['coords'],ha='right', color = 'red')
plt.title('School Districts in Kings County, WA', fontdict = {'fontsize': 20})
plt.ylabel('Latitude',fontdict = {'fontsize': 20})
plt.xlabel('Longitude',fontdict = {'fontsize': 20})

The lists — left, right, center — are from trial and error with the placement of the district names. Some overlapped or needed to be manipulated so that they did not stray too far from their actual district.

列表(左,右,中心)来自地区名称的放置,反复尝试。 有些重叠或需要进行操纵,以使它们不会偏离实际区域。

I’ve changed the color map to gist_earth for clarity. Next, I iterated through each row using the entry in the NAME series, and placing the title at a point that was definitely in the polygon. I aligned the names based on the lists I had made earlier. And this was out output:

为了清楚起见,我将颜色映射更改为gist_earth 。 接下来,我使用NAME系列中的条目遍历每一行,并将标题放置在肯定位于多边形中的点上。 我根据之前的清单排列了名称。 这是输出:

Image for post
School Districts of King County. Graphic by Author
金县学区。 图形作者

Each of the regions signifies a school district in King County. This matches the data I found about the twenty school districts in the county. I never really thought about the size and shape of a county, so I googled it just to be sure.

每个地区都代表金县的学区。 这与我发现的有关该县二十个学区的数据相匹配。 我从来没有真正考虑过一个县的大小和形状,所以我用谷歌搜索只是为了确定。

Image of Washington State with King County highlighted. From Google Maps
Source: Google Maps
资料来源:Google地图

It seemed like the Google Maps image was the perfect hole for my puzzle piece. From here, it was just a matter of formatting my data to fit the shape file. I did that by initiating my coordinate system and creating applicable points using the latitude and longitude of my houses.

似乎Google Maps图像是我的拼图的完美选择。 从这里开始,只需要格式化我的数据以适合形状文件即可。 我通过启动坐标系并使用房屋的纬度和经度来创建适用的点来完成此操作。

crs = {'init': 'epsg:4326'} # initiating my coordinate system
geometry = [Point(x,y) for x,y in zip(df.long,df.lat)] # creating points

If you were to look at an entry in geometry, you only get back that they are shapely objects. They need to be applied to our original dataframe. Below, you can see as I make a brand new dataframe that has the coordinate system built in, the old dataframe, and the addition of the points created by the intersection of the Latitude and Longitude of the houses.

如果要查看几何图形中的条目,您只会发现它们是匀称的对象。 它们需要应用于我们的原始数据框。 在下面,您可以看到当我制作一个全新的数据框时,该数据框内置了坐标系,旧的数据框,并添加了房屋的经度和纬度相交点。

geo_df = gpd.GeoDataFrame(df, # the dataframecrs = crs, # coordinate systemgeometry = geometry) # geometric points

That was the last step before we can plot the houses. Now, we put it all together.

那是我们绘制房屋之前的最后一步。 现在,我们将所有内容放在一起。

fig, ax = plt.subplots(figsize = (15,16))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df.plot(ax = ax , markersize = 2, color = 'blue',marker ='o',label = 'House', aspect = 1)
plt.legend(prop = {'size':10} )
ax.set_title('Houses in Kings County, WA', fontdict = {'fontsize':20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})

在上面的代码中,步骤包括: (In the code above, the steps include:)

  1. Calling an object to plot.

    调用对象进行绘图。
  2. Plotting the King County shape file.

    绘制金县形状文件。
  3. Plotting the data I made that includes the geometry point.

    绘制我制作的包括几何点的数据。

    This includes making markers, choosing the aspect, and adding the label for the legend.

    这包括制作标记,选择外观以及为图例添加标签。

  4. Adding a legend, title, and axis labels.

    添加图例,标题和轴标签。

These steps were done for each of the graphs.

对每个图形都完成了这些步骤。

Our output:

我们的输出:

Image for post

This is a great product but our goal is to learn something from this visualization. While this gives some information, like the outliers far to the eastern part of the county, it doesn’t give much else. We have to play with parameters. Let’s try splitting the data by price. These are the houses that are listed for less than $750,000.

这是一个很棒的产品,但是我们的目标是从可视化中学习一些东西。 尽管这提供了一些信息,例如该县东部的离群值,但它并没有提供其他信息。 我们必须使用参数。 让我们尝试按价格划分数据。 这些房屋的标价低于750,000美元。

fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] < 750000].plot(ax = ax , markersize = 2,color = 'red',marker = 's',label = 'Price < 750k',aspect = 1.5)
plt.legend(prop = {'size':15} )
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
Image for post
Houses priced below $750,000. Graphic by Author
价格低于750,000美元的房屋。 图形作者

Now we graph the houses greater than or equal to $750,000.

现在我们绘制大于或等于750,000美元的房子的图。

fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] >= 750000].plot(ax = ax , markersize = 2,color = 'yellow',marker = 'v',label = 'Price >=750k', aspect = 1.5)
plt.legend(prop = {'size':15})
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
Image for post
Houses priced above $750,000. Graphic by Author
价格在750,000美元以上的房屋。 图形作者

There is a big difference in terms of both location and quantity. But that is not the end, we can also layer them one on top of the other. We will be doing the expensive on top of the cheap because it is scarcer.

在位置和数量上都存在很大差异。 但这还没有结束,我们也可以将它们一个接一个地放置。 我们将在便宜的基础上再做昂贵的,因为它稀缺。

fig, ax = plt.subplots(figsize = (15,25))
kings_county.plot(ax=ax, alpha = 0.8, color = 'black')
geo_df[geo_df['price'] < 750000].plot(ax = ax , markersize = 1,color = 'red',marker = 's',label = 'Price <750k = Red', aspect = 1.5)
geo_df[geo_df['price'] >= 750000].plot(ax = ax , markersize = 1,color = 'yellow',marker = 'v',label = 'Price>= 750k = Yellow',aspect = 1.5)
plt.legend(prop = {'size':12})
ax.set_title('Houses by Price in Kings County, WA', fontdict ={'fontsize': 20})
ax.set_ylabel('Latitude',fontdict = {'fontsize': 20})
ax.set_xlabel('Longitude',fontdict = {'fontsize': 20})
Image for post
Side by side comparison. Graphic by Author
并排比较。 图形作者

The picture painted by this map is interesting. There is a plethora of housing in King County that falls below the bar we’ve set. Most of the houses on the lower end of the price scale falls more inland than the more expensive classes.

该地图绘制的图片很有趣。 金县的住房过多,低于我们设定的标准。 价格范围较低端的大多数房屋比昂贵的房屋价格下跌的地区更多。

If you zoom in, the more expensive houses dot the waterside. They also are more centrally located around the Seattle city center. There are several physical outliers but the trend is clear.

如果放大,则较贵的房屋将点缀在水边。 它们还位于西雅图市中心附近的中心位置。 有几个物理异常值,但趋势很明显。

Overall, the visualization has done its job. We have made several determinations from the houses on the map. Pricier houses are collected around the downtown area and spread around Puget Sound. They are also a minority in the data, which could be telling for predicting housing prices. The houses priced on the cheaper side are much more numerous and have a varied location. This will be useful for further EDA.

总体而言,可视化已完成工作。 我们已经从地图上的房屋中做出了一些决定。 价格较高的房屋在市区周围收集,并分布在普吉特海湾附近。 他们也是数据中的少数,这可能有助于预测房价。 价格便宜的房屋数量更多,并且位置各异。 这对于进一步的EDA很有用。

If you want to connect to talk more about this technique, you can find me on LinkedIn. If you would like to check out the code, take a look at my Github.

如果您想联系以更多地谈论这种技术,可以在LinkedIn上找到我。 如果您想查看代码,请查看我的Github 。

资料来源 (Sources)

  • King County Dataset — here

    金县数据集- 此处

    King County Shape File —

    金县形状文件—

    here

    这里

  • Geopandas

    大熊猫

  • Shapely

    匀称

  • Descartes

    笛卡尔

  • Fiona

    菲奥娜

翻译自: https://towardsdatascience.com/using-geopandas-for-spatial-visualization-21e78984dc37

多维空间可视化

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/390912.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

机器学习 来源框架_机器学习的秘密来源:策展

机器学习 来源框架成功的机器学习/人工智能方法 (Methods for successful Machine learning / Artificial Intelligence) It’s widely stated that data is the new oil, and like oil, data needs the right refinement to evolve to be utilised perfectly. The power of ma…

WebLogic调用WebService提示Failed to localize、Failed to create WsdlDefinitionFeature

在本地Tomcat环境下调用WebService正常&#xff0c;但是部署到WebLogic环境中&#xff0c;则提示警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_2 ......警告&#xff1a;[Failed to localize] MEX0008.PARSING_MDATA_FAILURE<SOAP_1_1 ..…

呼吁开放外网_服装数据集:呼吁采取行动

呼吁开放外网Getting a dataset with images is not easy if you want to use it for a course or a book. Yes, there are many datasets with images, but few of them are suitable for commercial or educational use.如果您想将其用于课程或书籍&#xff0c;则获取带有图像…

React JS 组件间沟通的一些方法

刚入门React可能会因为React的单向数据流的特性而遇到组件间沟通的麻烦&#xff0c;这篇文章主要就说一说如何解决组件间沟通的问题。 1.组件间的关系 1.1 父子组件 ReactJS中数据的流动是单向的&#xff0c;父组件的数据可以通过设置子组件的props传递数据给子组件。如果想让子…

数据可视化分析票房数据报告_票房收入分析和可视化

数据可视化分析票房数据报告Welcome back to my 100 Days of Data Science Challenge Journey. On day 4 and 5, I work on TMDB Box Office Prediction Dataset available on Kaggle.欢迎回到我的100天数据科学挑战之旅。 在第4天和第5天&#xff0c;我将研究Kaggle上提供的TM…

先知模型 facebook_Facebook先知

先知模型 facebook什么是先知&#xff1f; (What is Prophet?) “Prophet” is an open-sourced library available on R or Python which helps users analyze and forecast time-series values released in 2017. With developers’ great efforts to make the time-series …

搭建Maven私服那点事

摘要&#xff1a;本文主要介绍在CentOS7.1下使用nexus3.6.0搭建maven私服&#xff0c;以及maven私服的使用&#xff08;将自己的Maven项目指定到私服地址、将第三方项目jar上传到私服供其他项目组使用&#xff09; 一、简介 Maven是一个采用纯Java编写的开源项目管理工具, Mave…

gan训练失败_我尝试过(但失败了)使用GAN来创作艺术品,但这仍然值得。

gan训练失败This work borrows heavily from the Pytorch DCGAN Tutorial and the NVIDA paper on progressive GANs.这项工作大量借鉴了Pytorch DCGAN教程 和 有关渐进式GAN 的 NVIDA论文 。 One area of computer vision I’ve been wanting to explore are GANs. So when m…

19.7 主动模式和被动模式 19.8 添加监控主机 19.9 添加自定义模板 19.10 处理图形中的乱码 19.11 自动发现...

2019独角兽企业重金招聘Python工程师标准>>> 19.7 主动模式和被动模式 • 主动或者被动是相对客户端来讲的 • 被动模式&#xff0c;服务端会主动连接客户端获取监控项目数据&#xff0c;客户端被动地接受连接&#xff0c;并把监控信息传递给服务端 服务端请求以后&…

华盛顿特区与其他地区的差别_使用华盛顿特区地铁数据确定可获利的广告位置...

华盛顿特区与其他地区的差别深度分析 (In-Depth Analysis) Living in Washington DC for the past 1 year, I have come to realize how WMATA metro is the lifeline of this vibrant city. The metro network is enormous and well-connected throughout the DMV area. When …

Windows平台下kafka环境的搭建

近期在搞kafka&#xff0c;在Windows环境搭建的过程中遇到一些问题&#xff0c;把具体的流程几下来防止后面忘了。 准备工作&#xff1a; 1.安装jdk环境 http://www.oracle.com/technetwork/java/javase/downloads/index.html 2.下载kafka的程序安装包&#xff1a; http://kafk…

铺装s路画法_数据管道的铺装之路

铺装s路画法Data is a key bet for Intuit as we invest heavily in new customer experiences: a platform to connect experts anywhere in the world with customers and small business owners, a platform that connects to thousands of institutions and aggregates fin…

IBM推全球首个5纳米芯片:计划2020年量产

IBM日前宣布&#xff0c;该公司已取得技术突破&#xff0c;利用5纳米技术制造出密度更大的芯片。这种芯片可以将300亿个5纳米开关电路集成在指甲盖大小的芯片上。 IBM推全球首个5纳米芯片 IBM表示&#xff0c;此次使用了一种新型晶体管&#xff0c;即堆叠硅纳米板&#xff0c;将…

async 和 await的前世今生 (转载)

async 和 await 出现在C# 5.0之后&#xff0c;给并行编程带来了不少的方便&#xff0c;特别是当在MVC中的Action也变成async之后&#xff0c;有点开始什么都是async的味道了。但是这也给我们编程埋下了一些隐患&#xff0c;有时候可能会产生一些我们自己都不知道怎么产生的Bug&…

项目案例:qq数据库管理_2小时元项目:项目管理您的数据科学学习

项目案例:qq数据库管理Many of us are struggling to prioritize our learning as a working professional or aspiring data scientist. We’re told that we need to learn so many things that at times it can be overwhelming. Recently, I’ve felt like there could be …

react 示例_2020年的React Cheatsheet(+真实示例)

react 示例Ive put together for you an entire visual cheatsheet of all of the concepts and skills you need to master React in 2020.我为您汇总了2020年掌握React所需的所有概念和技能的完整视觉摘要。 But dont let the label cheatsheet fool you. This is more than…

查询数据库中有多少个数据表_您的数据中有多少汁?

查询数据库中有多少个数据表97%. That’s the percentage of data that sits unused by organizations according to Gartner, making up so-called “dark data”.97 &#xff05;。 根据Gartner的说法&#xff0c;这就是组织未使用的数据百分比&#xff0c;即所谓的“ 暗数据…

数据科学与大数据技术的案例_作为数据科学家解决问题的案例研究

数据科学与大数据技术的案例There are two myths about how data scientists solve problems: one is that the problem naturally exists, hence the challenge for a data scientist is to use an algorithm and put it into production. Another myth considers data scient…

Spring-Boot + AOP实现多数据源动态切换

2019独角兽企业重金招聘Python工程师标准>>> 最近在做保证金余额查询优化&#xff0c;在项目启动时候需要把余额全量加载到本地缓存&#xff0c;因为需要全量查询所有骑手的保证金余额&#xff0c;为了不影响主数据库的性能&#xff0c;考虑把这个查询走从库。所以涉…

leetcode 1738. 找出第 K 大的异或坐标值

本文正在参加「Java主题月 - Java 刷题打卡」&#xff0c;详情查看 活动链接 题目 给你一个二维矩阵 matrix 和一个整数 k &#xff0c;矩阵大小为 m x n 由非负整数组成。 矩阵中坐标 (a, b) 的 值 可由对所有满足 0 < i < a < m 且 0 < j < b < n 的元素…