分析citibike数据eda

数据科学 (Data Science)

CitiBike is New York City’s famous bike rental company and the largest in the USA. CitiBike launched in May 2013 and has become an essential part of the transportation network. They make commute fun, efficient, and affordable — not to mention healthy and good for the environment.

CitiBike是纽约市著名的自行车租赁公司,也是美国最大的自行车租赁公司。 花旗自行车(CitiBike)于2013年5月推出,现已成为交通网络的重要组成部分。 它们使通勤变得有趣,高效且负担得起,更不用说健康且对环境有益。

I have got the data of CityBike riders of June 2013 from Kaggle. I will walk you through the complete exploratory data analysis answering some of the questions like:

我从Kaggle获得了2013年6月的CityBike骑手数据。 我将引导您完成完整的探索性数据分析,回答一些问题,例如:

  1. Where do CitiBikers ride?

    CitiBikers骑在哪里?
  2. When do they ride?

    他们什么时候骑?
  3. How far do they go?

    他们走了多远?
  4. Which stations are most popular?

    哪个电台最受欢迎?
  5. What days of the week are most rides taken on?

    大多数游乐设施在一周的哪几天?
  6. And many more

    还有很多

Key learning:

重点学习:

I have used many parameters to tweak the plotting functions of Matplotlib and Seaborn. It will be a good read to learn them practically.

我使用了许多参数来调整Matplotlib和Seaborn的绘图功能。 实际学习它们将是一本好书。

Note:

注意:

This article is best viewed on a larger screen like a tablet or desktop. At any point of time if you find difficulty in understanding anything I will be dropping the link to my Kaggle notebook at the end of this article, you can drop your quaries in the comment section.

最好在平板电脑或台式机等较大的屏幕上查看本文。 在任何时候,如果您发现难以理解任何内容,那么在本文结尾处,我都会删除指向我的Kaggle笔记本的链接,您可以在评论部分中删除您的查询。

让我们开始吧 (Let’s get started)

Importing necessary libraries and reading data.

导入必要的库并读取数据。

#importing necessary libraries
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns#setting plot style to seaborn
plt.style.use('seaborn')#reading data
df = pd.read_csv('../input/citibike-system-data/201306-citibike-tripdata.csv')
df.head()
CitiBike dataset

Let’s get some more information on the data.

让我们获取有关数据的更多信息。

df.info()
Image for post
#sum of missing values in each column
df.isna().sum()
Image for post

We have whooping 5,77,703 rows to crunch and 15 columns. Also, quite a bit of missing values. Let’s deal with missing values first.

我们有多达5,77,703行要紧缩和15列。 此外,还有很多缺失值。 让我们先处理缺失值。

处理缺失值 (Handling missing values)

Let’s first see the percentage of missing values which will help us decide whether to drop them or no.

首先让我们看看缺失值的百分比,这将有助于我们决定是否删除它们。

#calculating the percentage of missing values
#sum of missing value is the column divided by total number of rows in the dataset multiplied by 100data_loss1 = round((df['end station id'].isna().sum()/df.shape[0])*100)
data_loss2 = round((df['birth year'].isna().sum()/df.shape[0])*100)print(data_loss1, '% of data loss if NaN rows of end station id, \nend station name, end station latitude and end station longitude dropped.\n')
print(data_loss2, '% of data loss if NaN rows of birth year dropped.')
Image for post

We can not afford to drop the missing valued rows of ‘birth year’. Hence, drop the entire column ‘birth year’ and drop missing valued rows of ‘end station id’,‘ end station name’,‘ end station latitude’, and ‘end station longitude’. Fortunately, all the missing values in these four rows (end station id, end station name, end station latitude, and end station longitude) are on the exact same row, so dropping NaN rows from all four rows will still result in only 3% data loss.

我们不能舍弃丢失的“出生年份”有价值的行。 因此,删除整列“出生年”并删除“终端站ID”,“终端站名称”,“终端站纬度”和“终端站经度”的缺失值行。 幸运的是,这四行中的所有缺失值(终端站ID,终端站名称,终端站纬度和终端站经度)都在同一行上,因此从所有四行中删除NaN行仍将仅导致3%数据丢失。

#dropping NaN values
rows_before_dropping = df.shape[0]
#drop entire birth year column.
df.drop(’birth year’,axis=1, inplace=True)#Now left with end station id, end station name, end station latitude and end station longitude
#these four columns have missing values in exact same row,
#so dropping NaN from all four columns will still result in only 3% data loss
df.dropna(axis=0, inplace=True)
rows_after_dropping = df.shape[0]#total data loss
print('% of data lost: ',((rows_before_dropping-rows_after_dropping)/rows_before_dropping)*100)#checking for NaN
df.isna().sum()
Image for post

让我们看看性别在谈论我们的数据 (Let’s see what gender talks about our data)

#plotting total no.of males and females
splot = sns.countplot('gender', data=df)#adding value above each bar:Annotation
for p in splot.patches:
an = splot.annotate(format(p.get_height(), '.2f'),
#bar value is nothing but height of the bar
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
an.set_size(20)#test size
splot.axes.set_title("Gender distribution",fontsize=30)
splot.axes.set_xlabel("Gender",fontsize=20)
splot.axes.set_ylabel("Count",fontsize=20)#adding x tick values
splot.axes.set_xticklabels(['Unknown', 'Male', 'Female'])
plt.show()
Image for post

We can see more male riders than females in New York City but due to a large number of unknown gender, we cannot get to any concrete conclusion. Filling unknown gender values is possible but we are not going to do it considering riders did not choose to disclose their gender.

在纽约市,我们看到男性骑手的人数多于女性骑手,但由于性别众多,我们无法得出任何具体结论。 可以填写未知的性别值,但考虑到车手没有选择公开性别,我们不会这样做。

订户与客户 (Subscribers vs Customers)

Subscribers are the users who bought the annual pass and customers are the once who bought either a 24-hour pass or a 3-day pass. Let’s see what the riders choose the most.

订户是购买年度通行证的用户,客户是购买24小时通行证或3天通行证的用户。 让我们来看看骑手最喜欢的东西。

user_type_count = df[’usertype’].value_counts()
plt.pie(user_type_count.values,
labels=user_type_count.index,
autopct=’%1.2f%%’,
textprops={’fontsize’: 15} )
plt.title(’Subcribers vs Customers’, fontsize=20)
plt.show()
Image for post

We can see there is more number of yearly subscribers than 1-3day customers. But the difference is not much, the company has to focus on converting customers to subscribers with some offers or sale.

我们可以看到,每年订阅者的数量超过1-3天的客户。 但是差异并不大,该公司必须专注于将客户转换为具有某些要约或销售的订户。

骑自行车通常需要花费几个小时 (How many hours do rides use the bike typically)

We have a column called ‘timeduration’ which talks about the duration each trip covered which is in seconds. Firstly, we will convert it to minutes, then create bins to group the trips into 0–30min, 30–60min, 60–120min, 120min, and above ride time. Then, let’s plot a graph to see how many hours do rides ride the bike typically.

我们有一个名为“ timeduration”的列,它讨论了每次旅行的持续时间,以秒为单位。 首先,我们将其转换为分钟,然后创建垃圾箱,将行程分为0–30分钟,30–60分钟,60–120分钟,120分钟及以上行驶时间。 然后,让我们绘制一个图表,看看骑车通常需要骑几个小时。

#converting trip duration from seconds to minuits
df['tripduration'] = df['tripduration']/60#creating bins (0-30min, 30-60min, 60-120min, 120 and above)
max_limit = df['tripduration'].max()
df['tripduration_bins'] = pd.cut(df['tripduration'], [0, 30, 60, 120, max_limit])sns.barplot(x='tripduration_bins', y='tripduration', data=df, estimator=np.size)
plt.title('Usual riding time', fontsize=30)
plt.xlabel('Trip duration group', fontsize=20)
plt.ylabel('Trip Duration', fontsize=20)
plt.show()
Image for post

There are a large number of riders who ride for less than half an hour per trip and most less than 1 hour.

有大量的骑手每次骑行少于半小时,最多少于1小时。

相同的开始和结束位置VS不同的开始和结束位置 (Same start and end location VS different start and end location)

We see in the data there are some trips that start and end at the same location. Let’s see how many.

我们在数据中看到一些行程在同一位置开始和结束。 让我们看看有多少。

#number of trips that started and ended at same station
start_end_same = df[df['start station name'] == df['end station name']].shape[0]#number of trips that started and ended at different station
start_end_diff = df.shape[0]-start_end_sameplt.pie([start_end_same,start_end_diff],
labels=['Same start and end location',
'Different start and end location'],
autopct='%1.2f%%',
textprops={'fontsize': 15})
plt.title('Same start and end location vs Different start and end location', fontsize=20)
plt.show()
Image for post

本月的骑行方式 (Riding pattern of the month)

This part is where I have spent a lot of time and effort. The below graph talks a lot. Technically there is a lot of coding. Before looking at the code I will give an overview of what we are doing here. Basically, we are plotting a time series graph to see the trend of the number of rides taken per day and the trend of the total number of duration the bikes were in use per day. Let’s look at the code first then I will break it down for you.

这是我花费大量时间和精力的地方。 下图很讲究。 从技术上讲,有很多编码。 在查看代码之前,我将概述我们在这里所做的事情。 基本上,我们正在绘制一个时间序列图,以查看每天骑行次数的趋势以及每天使用自行车的持续时间总数的趋势。 让我们先看一下代码,然后我将为您分解代码。

#converting string to datetime object
df['starttime']= pd.to_datetime(df['starttime'])#since we are dealing with single month, we grouping by days
#using count aggregation to get number of occurances i.e, total trips per day
start_time_count = df.set_index('starttime').groupby(pd.Grouper(freq='D')).count()#we have data from July month for only one day which is at last row, lets drop it
start_time_count.drop(start_time_count.tail(1).index, axis=0, inplace=True)#again grouping by day and aggregating with sum to get total trip duration per day
#which will used while plotting
trip_duration_count = df.set_index('starttime').groupby(pd.Grouper(freq='D')).sum()#again dropping the last row for same reason
trip_duration_count.drop(trip_duration_count.tail(1).index, axis=0, inplace=True)#plotting total rides per day
#using start station id to get the count
fig,ax=plt.subplots(figsize=(25,10))
ax.bar(start_time_count.index, 'start station id', data=start_time_count, label='Total riders')
#bbox_to_anchor is to position the legend box
ax.legend(loc ="lower left", bbox_to_anchor=(0.01, 0.89), fontsize='20')
ax.set_xlabel('Days of the month June 2013', fontsize=30)
ax.set_ylabel('Riders', fontsize=40)
ax.set_title('Bikers trend for the month June', fontsize=50)#creating twin x axis to plot line chart is same figure
ax2=ax.twinx()
#plotting total trip duration of all user per day
ax2.plot('tripduration', data=trip_duration_count, color='y', label='Total trip duration', marker='o', linewidth=5, markersize=12)
ax2.set_ylabel('Time duration', fontsize=40)
ax2.legend(loc ="upper left", bbox_to_anchor=(0.01, 0.9), fontsize='20')ax.set_xticks(trip_duration_count.index)
ax.set_xticklabels([i for i in range(1,31)])#tweeking x and y ticks labels of axes1
ax.tick_params(labelsize=30, labelcolor='#eb4034')
#tweeking x and y ticks labels of axes2
ax2.tick_params(labelsize=30, labelcolor='#eb4034')plt.show()

You might have understood the basic idea by reading the comments but let me explain the process step-by-step:

您可能通过阅读评论已经了解了基本思想,但让我逐步解释了该过程:

  1. The date-time is in the string, we will convert it into DateTime object.

    日期时间在字符串中,我们将其转换为DateTime对象。
  2. Grouping the data by days of the month and counting the number of occurrences to plot rides per day.

    将数据按每月的天数进行分组,并计算每天的出行次数。
  3. We have only one row with the information for the month of July. This is an outlier, drop it.

    我们只有一行包含7月份的信息。 这是一个离群值,将其删除。
  4. Repeat steps 2 and 3 but the only difference this time is we sum the data instead of counting to get the total time duration of the trips per day.

    重复第2步和第3步,但是这次唯一的区别是我们对数据求和而不是进行计数以获得每天行程的总持续时间。
  5. Plot both the data on a single graph using the twin axis method.

    使用双轴方法将两个数据绘制在一个图形上。

I have used a lot of tweaking methods on matplotlib, make sure to go through each of them. If any doubts drop a comment on the Kaggle notebook for which the link will be dropped at the end of this article.

我在matplotlib上使用了很多调整方法,请确保每种方法都要经过。 如果有任何疑问,请在Kaggle笔记本上发表评论,其链接将在本文结尾处删除。

Image for post

The number of riders increases considerably closing to the end of the month. There are negligible riders on the 1st Sunday of the month. The amount of time the bikers ride the bike decreases closing to the end of the month.

到月底,车手的数量大大增加。 每个月的第一个星期日的车手微不足道。 骑自行车的人骑自行车的时间减少到月底接近。

前10个出发站 (Top 10 start stations)

This is pretty straightforward, we get the occurrences of each start station using value_counts() and slice to get the first 10 values from it then plot the same.

这非常简单,我们使用value_counts()和slice来获取每个起始站点的出现,然后从中获取前10个值,然后对其进行绘制。

#adding value above each bar:Annotation
for p in ax.patches:
an = ax.annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
an.set_size(20)
ax.set_title("Top 10 start locations in NY",fontsize=30)
ax.set_xlabel("Station name",fontsize=20)#rotating the x tick labels to 45 degrees
ax.set_xticklabels(top_start_station.index, rotation = 45, ha="right")
ax.set_ylabel("Count",fontsize=20)
#tweeking x and y tick labels
ax.tick_params(labelsize=15)
plt.show()
Image for post

十佳终端站 (Top 10 end stations)

#top 10 end station
top_end_station = df['end station name'].value_counts()[:10]fig,ax=plt.subplots(figsize=(20,8))
ax.bar(x=top_end_station.index, height=top_end_station.values, color='#edde68', width=0.5)#adding value above each bar:Annotation
for p in ax.patches:
an = ax.annotate(format(p.get_height(), '.2f'),
(p.get_x() + p.get_width() / 2., p.get_height()),
ha = 'center',
va = 'center',
xytext = (0, 10),
textcoords = 'offset points')
an.set_size(20)
ax.set_title("Top 10 end locations in NY",fontsize=30)
ax.set_xlabel("Street name",fontsize=20)#rotating the x tick labels to 45 degrees
ax.set_xticklabels(top_end_station.index, rotation = 45, ha="right")
ax.set_ylabel("Count",fontsize=20)
#tweeking x and y tick labels
ax.tick_params(labelsize=15)
plt.show()
Image for post

Kaggle Notebook where I worked it out. Feel free to drop queries in the comment section.

Kaggle笔记本是我在其中解决的。 随时在评论部分中删除查询。

翻译自: https://medium.com/towards-artificial-intelligence/analyzing-citibike-data-eda-e657409f007a

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/389397.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

上采样(放大图像)和下采样(缩小图像)(最邻近插值和双线性插值的理解和实现)

上采样和下采样 什么是上采样和下采样? • 缩小图像(或称为下采样(subsampled)或降采样(downsampled))的主要目的有 两个:1、使得图像符合显示区域的大小;2、生成对应图…

r语言绘制雷达图_用r绘制雷达蜘蛛图

r语言绘制雷达图I’ve tried several different types of NBA analytical articles within my readership who are a group of true fans of basketball. I found that the most popular articles are not those with state-of-the-art machine learning technologies, but tho…

java 分裂数字_分裂的补充:超越数字,打印物理可视化

java 分裂数字As noted in my earlier Nightingale writings, color harmony is the process of choosing colors on a Color Wheel that work well together in the composition of an image. Today, I will step further into color theory by discussing the Split Compleme…

结构化数据建模——titanic数据集的模型建立和训练(Pytorch版)

本文参考《20天吃透Pytorch》来实现titanic数据集的模型建立和训练 在书中理论的同时加入自己的理解。 一,准备数据 数据加载 titanic数据集的目标是根据乘客信息预测他们在Titanic号撞击冰山沉没后能否生存。 结构化数据一般会使用Pandas中的DataFrame进行预处理…

比赛,幸福度_幸福与生活满意度

比赛,幸福度What is the purpose of life? Is that to be happy? Why people go through all the pain and hardship? Is it to achieve happiness in some way?人生的目的是什么? 那是幸福吗? 人们为什么要经历所有的痛苦和磨难? 是通过…

带有postgres和jupyter笔记本的Titanic数据集

PostgreSQL is a powerful, open source object-relational database system with over 30 years of active development that has earned it a strong reputation for reliability, feature robustness, and performance.PostgreSQL是一个功能强大的开源对象关系数据库系统&am…

Django学习--数据库同步操作技巧

同步数据库:使用上述两条命令同步数据库1.认识migrations目录:migrations目录作用:用来存放通过makemigrations命令生成的数据库脚本,里面的生成的脚本不要轻易修改。要正常的使用数据库同步的功能,app目录下必须要有m…

React 新 Context API 在前端状态管理的实践

2019独角兽企业重金招聘Python工程师标准>>> 本文转载至:今日头条技术博客 众所周知,React的单向数据流模式导致状态只能一级一级的由父组件传递到子组件,在大中型应用中较为繁琐不好管理,通常我们需要使用Redux来帮助…

机器学习模型 非线性模型_机器学习模型说明

机器学习模型 非线性模型A Case Study of Shap and pdp using Diabetes dataset使用糖尿病数据集对Shap和pdp进行案例研究 Explaining Machine Learning Models has always been a difficult concept to comprehend in which model results and performance stay black box (h…

5分钟内完成胸部CT扫描机器学习

This post provides an overview of chest CT scan machine learning organized by clinical goal, data representation, task, and model.这篇文章按临床目标,数据表示,任务和模型组织了胸部CT扫描机器学习的概述。 A chest CT scan is a grayscale 3…

Pytorch高阶API示范——线性回归模型

本文与《20天吃透Pytorch》有所不同,《20天吃透Pytorch》中是继承之前的模型进行拟合,本文是单独建立网络进行拟合。 代码实现: import torch import numpy as np import matplotlib.pyplot as plt import pandas as pd from torch import …

作业要求 20181023-3 每周例行报告

本周要求参见:https://edu.cnblogs.com/campus/nenu/2018fall/homework/2282 1、本周PSP 总计:927min 2、本周进度条 代码行数 博文字数 用到的软件工程知识点 217 757 PSP、版本控制 3、累积进度图 (1)累积代码折线图 &…

算命数据_未来的数据科学家或算命精神向导

算命数据Real Estate Sale Prices, Regression, and Classification: Data Science is the Future of Fortune Telling房地产销售价格,回归和分类:数据科学是算命的未来 As we all know, I am unusually blessed with totally-real psychic abilities.众…

openai-gpt_为什么到处都看到GPT-3?

openai-gptDisclaimer: My opinions are informed by my experience maintaining Cortex, an open source platform for machine learning engineering.免责声明:我的看法是基于我维护 机器学习工程的开源平台 Cortex的 经验而 得出 的。 If you frequent any part…

Pytorch高阶API示范——DNN二分类模型

代码部分: import numpy as np import pandas as pd from matplotlib import pyplot as plt import torch from torch import nn import torch.nn.functional as F from torch.utils.data import Dataset,DataLoader,TensorDataset""" 准备数据 &qu…

OO期末总结

$0 写在前面 善始善终,临近期末,为一学期的收获和努力画一个圆满的句号。 $1 测试与正确性论证的比较 $1-0 什么是测试? 测试是使用人工操作或者程序自动运行的方式来检验它是否满足规定的需求或弄清预期结果与实际结果之间的差别的过程。 它…

数据可视化及其重要性:Python

Data visualization is an important skill to possess for anyone trying to extract and communicate insights from data. In the field of machine learning, visualization plays a key role throughout the entire process of analysis.对于任何试图从数据中提取和传达见…

【洛谷算法题】P1046-[NOIP2005 普及组] 陶陶摘苹果【入门2分支结构】Java题解

👨‍💻博客主页:花无缺 欢迎 点赞👍 收藏⭐ 留言📝 加关注✅! 本文由 花无缺 原创 收录于专栏 【洛谷算法题】 文章目录 【洛谷算法题】P1046-[NOIP2005 普及组] 陶陶摘苹果【入门2分支结构】Java题解🌏题目…

python多项式回归_如何在Python中实现多项式回归模型

python多项式回归Let’s start with an example. We want to predict the Price of a home based on the Area and Age. The function below was used to generate Home Prices and we can pretend this is “real-world data” and our “job” is to create a model which wi…

充分利用UC berkeleys数据科学专业

By Kyra Wong and Kendall Kikkawa黄凯拉(Kyra Wong)和菊川健多 ( Kendall Kikkawa) 什么是“数据科学”? (What is ‘Data Science’?) Data collection, an important aspect of “data science”, is not a new idea. Before the tech boom, every industry al…