数据仓库项目分析_数据分析项目:仓库库存

数据仓库项目分析

The code for this project can be found at my GitHub.

该项目的代码可以在我的GitHub上找到 。

介绍 (Introduction)

The goal of this project was to analyse historic stock/inventory data to decide how much stock of each item a retailer should hold in the future. When deciding which historic data to use I came across a great blog post by Nhan Tran on Towards Data Science. This post provided me with the data set I was looking for. The best thing about this post was that it gave a brief outline of a project, but no code. This meant I had to write all the code myself from scratch (which is good practice), yet I could check my answers at each stage using the landmarks in Nhan Tran’s blog post.

该项目的目的是分析历史库存/库存数据,以确定零售商将来应持有的每件商品多少库存。 在决定使用哪些历史数据时,我遇到了Nhan Tran撰写的一篇很棒的博客文章 ,名为Towards Data Science 。 这篇文章为我提供了我想要的数据集 。 关于这篇文章的最好的事情是它给出了一个项目的简要概述,但是没有代码。 这意味着我必须自己重新编写所有代码(这是一种很好的做法),但是我可以使用Nhan Tran 博客文章中的地标在每个阶段检查答案。

I completed the project using Python in a Jupyter Notebook.

我在Jupyter Notebook中使用Python完成了该项目 。

技术目标 (Technical Goal)

To give (with 95% confidence) a lower and upper bound for how many pairs of Men’s shoes, of each size, a shop in the United States should stock for any given month.

为了给(具有95%的置信度)上下限,确定每个给定月份在美国的商店应存多少种每种尺寸的男鞋。

设置笔记本并导入数据 (Setting Up the Notebook and Importing the Data)

First I read through the project and decided which libraries I would need to use. I then imported those libraries into my notebook:

首先,我通读该项目,并确定需要使用哪些库。 然后,我将这些库导入到笔记本中:

Note: some of these libraries I didn’t foresee using and I went back and added them as and when I needed them.

注意 :其中一些我没有预见到的库,我回去并在需要时添加它们。

Next I downloaded the data set and imported it into my notebook using read_csv() from the pandas library:

接下来,我从pandas库中使用read_csv()下载了数据集并将其导入到笔记本中:

sales = pd.read_csv("Al-Bundy_raw-data.csv")

检查和清理数据 (Inspecting and Cleaning the Data)

Before beginning the project I inspected the DataFrame.

在开始项目之前,我检查了DataFrame。

Image for post
DataFrame containing all imported data.
包含所有导入数据的DataFrame。

Each row on the data set represent a pair of shoes that had been sold.

数据集的每一行代表一双已售出的鞋子。

I then inspected the data types of each column to ensure they made sense.

然后,我检查了每一列的数据类型,以确保它们有意义。

Image for post
The type of data to be found in each column.
在每一列中找到的数据类型。

I then changed the 'Date' datatype from object to datetime using the following code:

然后我改变了'Date'数据类型从objectdatetime使用以下代码:

sales['Date'] = pd.to_datetime(sales['Date'])

Based on the specific goal of this project I then decided to drop any columns that would not be relevant during this analysis. I dropped the European and UK size equivalents as only one size type was needed, and since I was focusing on US stores in particular I decided to keep US sizes to analyse. I dropped the Date column as I already had the month and year data in separate columns and did not need to be more specific than this. I also dropped the Product ID, Invoice Number, Discount and Sale Price as they were either irrelevant or their effect on stock levels was beyond the scope of this project. I achieved this using the following code:

基于此项目的特定目标,我然后决定删除此分析期间不相关的任何列。 因为只需要一种尺码类型,所以我放弃了欧洲和英国的尺码对应关系,并且由于我特别关注美国商店,因此我决定保留美国尺码进行分析。 我删除了Date列,因为我已经在不同的列中包含了月份和年份数据,并且不需要比此更具体。 我还删除了产品ID,发票编号,折扣和销售价格,因为它们无关紧要或它们对库存水平的影响超出了此项目的范围。 我使用以下代码实现了这一点:

sales_mini = sales.drop(['InvoiceNo', 'ProductID','Size (Europe)', 'Size (UK)', 'UnitPrice', 'Discount', 'Date'], axis=1)

This left me with a more streamlined and easier to read DataFrame to analyse.

这为我提供了更简化和更易于读取的DataFrame进行分析。

Image for post
New cleaned and prepped DataFrame.
新清洗并准备好的DataFrame。

数据分析与可视化 (Data Analysis and Visualisation)

I now decided to concentrate on an even smaller set of data, although this may have sacrificed accuracy by using a smaller data set, it also may have gained accuracy by using the most relevant data available.

我现在决定集中精力于更小的数据集,尽管使用较小的数据集可能会牺牲准确性,但使用可用的最相关数据也可能会提高准确性。

To do this I first created a new DataFrame of only Male shoes, using the following code:

为此,我首先使用以下代码创建了一个仅男鞋的新数据框:

sales_mini_male = sales_mini[sales_mini['Gender'] == 'Male']
Image for post
DataFrame showing only Male shoes and relevant columns.
DataFrame仅显示男鞋和相关列。

I then selected all of the rows where the shoes were sold in the United States in 2016, in order to make the data as relevant to our goal as possible. I achieved this with the following code:

然后,我选择了2016年在美国销售鞋子的所有行,以使数据尽可能与我们的目标相关。 我通过以下代码实现了这一点:

male_2016 = sales_mini_male[sales_mini_male['Year']==2016]
male_us_2016 = male_2016[male_2016['Country']=='United States']
Image for post
Data for Male shoes in the United States for 2016.
2016年美国男鞋数据。

The above DataFrame doesn’t make it clear which shoe sizes were the most frequent. To find out which shoe sizes were, I created a pivot table, analysing how many shoes of each size were sold each month (since people may be more likely to buy shoes in some months that others). I achieved this using the following code:

上面的DataFrame并不清楚哪个鞋号是最常见的。 为了找出哪种尺寸的鞋,我创建了一个数据透视表,分析了每个月售出每种尺寸的鞋的数量(因为人们可能会在几个月内购买其他鞋的可能性更高)。 我使用以下代码实现了这一点:

male_us_2016_by_month = pd.pivot_table(male_us_2016, values='Country', index=['Size (US)'], columns=['Month'], fill_value=0, aggfunc=len)
Image for post
Pivot table showing how many Male shoes were sold, of each size, in each month of 2016, in the US.
数据透视表显示了2016年每个月在美国售出的各种尺寸的男鞋。

Albeit more useful, the above table is still difficult to read. To solve this problem I imported the seaborn library, and displayed the pivot table as a heat map.

上表虽然更有用,但仍然很难阅读。 为了解决此问题,我导入了seaborn库,并将数据透视表显示为热图。

plt.figure(figsize=(16, 6))
male_us_2016_by_month_heatmap = sns.heatmap(male_us_2016_by_month, annot=True, fmt='g', cmap='Blues')
Image for post
Heat map of sizes sold per month.
每月销售尺寸的热图。

This heat map indicates that demand for shoes, across different sizes is likely to be a normal distribution. This makes sense when thinking logically, as few people have extremely small or extremely large feet, but many have a shoe size somewhere in between. I then illustrated this even clearer by plotting the total yearly demand for each size in a bar chart.

该热图表明,不同尺寸的鞋子需求可能是正态分布。 从逻辑上考虑时,这是有道理的,因为很少有人脚很小或太大,但是很多人的鞋子介于两者之间。 然后,我通过在条形图中绘制每种尺寸的年度总需求来说明这一点。

male_us_2016_by_month_with_total = pd.pivot_table(male_us_2016, values='Country', index=['Size (US)'], columns=['Month'], fill_value=0, margins=True, margins_name='Total', aggfunc=len)
male_us_2016_by_month_with_total_right = male_us_2016_by_month_with_total.iloc[:-1, :]
male_us_2016_by_month_with_total_right = male_us_2016_by_month_with_total_right.reset_index()
male_us_2016_total_by_size = male_us_2016_by_month_with_total_right[['Size (US)', 'Total']]
male_us_2016_by_size_plot = male_us_2016_total_by_size.plot.bar(x='Size (US)',y='Total', legend=False)
male_us_2016_by_size_plot.set_ylabel("Frequency")
Image for post
Bar plot showing total demand for each shoe size (Male, US, 2016).
条形图显示了每种鞋码的总需求(男性,美国,2016年)。

Although this analysis would be useful to a retailer, this is just an overview of what happened in 2016.

尽管此分析对零售商有用,但这只是2016年情况的概述。

学生考试 (Student’s T-test)

To give a prediction on future levels of demand, with a given degree of confidence, I performed a statistical test on the data. Since the remaining relevant data was a small data set, and that it closely represented a normal distribution, I decided a Student’s T-test would be the most relevant.

为了以给定的可信度预测未来的需求水平,我对数据进行了统计检验。 由于剩余的相关数据是一个很小的数据集,并且它紧密地代表了正态分布,因此我认为学生的T检验将是最相关的。

First I found the t-value to be used in a 2-tailed test finding 95% confidence intervals (note: 0.5 had to be divided by 2 to get 0.025 as the test is 2-tailed).

首先,我发现t值用于2尾检验中,发现置信区间为95%(注意:由于2尾检验,必须将0.5除以2才能得到0.025)。

t_value = stats.t.ppf(1-0.025,11)
Image for post
t-value for 2-tailed test at 95%
2尾测试的t值为95%

I then used this t-value to calculate the ‘Margin Error’ and displayed this along with other useful aggregates in the following table using the following code:

然后,我使用此t值来计算“边距误差”,并使用以下代码在下表中将其与其他有用的汇总一起显示:

male_us_2016_agg = pd.DataFrame()
male_us_2016_agg['Size (US)'] = male_us_2016_by_month_with_total['Size (US)']
male_us_2016_agg['Mean'] = male_us_2016_by_month_with_total.mean(1)
male_us_2016_agg['Standard Error'] = male_us_2016_by_month_with_total.sem(1)
male_us_2016_agg['Margin Error'] = male_us_2016_agg['Standard Error'] * t_value
male_us_2016_agg['95% CI Lower Bound'] = male_us_2016_agg['Mean'] - male_us_2016_agg['Margin Error']
male_us_2016_agg['95% CI Upper Bound'] = male_us_2016_agg['Mean'] + male_us_2016_agg['Margin Error']
Image for post
Table of t-test results.
t检验结果表。

零售商友好输出 (Retailer Friendly Output)

I then decided to re-present my data in a more easy to understand format. I made sure python did as much of the work for me here as possible to ensure my code was efficient and easy to replicate and scale.

然后,我决定以一种更易于理解的格式重新呈现我的数据。 我确保python在这里为我做了尽可能多的工作,以确保我的代码高效且易于复制和扩展。

conclusion = pd.DataFrame()
conclusion['Size (US)'] = male_us_2016_agg['Size (US)']
conclusion['Lower Bound'] = male_us_2016_agg['95% CI Lower Bound'].apply(np.ceil)
conclusion['Lower Bound'] = conclusion['Lower Bound'].astype(int)
conclusion['Upper Bound'] = male_us_2016_agg['95% CI Upper Bound'].apply(np.floor)
conclusion['Upper Bound'] = conclusion['Upper Bound'].astype(int)
conclusion['Conclusion'] = np.where(conclusion['Size (US)'] == 'Total', 'Based on data from 2016, we would expect, with 95% confidence, to sell atleast ' + conclusion['Lower Bound'].astype(str) + ' pair(s), and upto ' + conclusion['Upper Bound'].astype(str) + ' pair(s) of shoes in a US store each month.', 'Based on data from 2016, we would expect, with 95% confidence, to sell atleast ' + conclusion['Lower Bound'].astype(str) + ' pair(s), and upto ' + conclusion['Upper Bound'].astype(str) + ' pair(s) of size ' + conclusion['Size (US)'].astype(str) + ' shoes in a US store each month.')
pd.set_option('display.max_colwidth',200)
Image for post
Results of t-test presented in a retailer friendly format.
t检验的结果以零售商友好的形式呈现。

可能进一步分析 (Possibly Further Analysis)

The same analysis above can all be completed for different 'Gender', 'Country' and 'Year' values. In total this would produce 2 x 4 x 3 = 24 different sets of bounds to guide the retailer. These bounds could be used to guide a retailer in each specific circumstance. Alternatively, if these results don't differ much we may use this as a reason to use a larger data set. For example, if our bounds don't change much for each 'Year', we may want to use the 'Male' 'United States' data from all years to get a more accurate result.

对于不同的'Gender''Country''Year'值,都可以完成以上相同的分析。 总体而言,这将产生2 x 4 x 3 = 24组不同的边界以指导零售商。 这些界限可以用来指导零售商在每种特定情况下。 另外,如果这些结果相差不大,我们可以以此为理由使用较大的数据集。 例如,如果每个'Year'界限变化不大,我们可能希望使用所有年份的'Male' 'United States'数据来获得更准确的结果。

Thanks for reading and thanks to Nhan Tran for the original post that guided this project.

感谢您的阅读,也感谢Nhan Tran提供了指导该项目的原始文章。

The code for this project can be found at my GitHub.

该项目的代码可以在我的GitHub上找到 。

翻译自: https://medium.com/swlh/data-analysis-project-warehouse-inventory-be4b4fee881f

数据仓库项目分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/391415.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

leetcode 213. 打家劫舍 II(dp)

你是一个专业的小偷,计划偷窃沿街的房屋,每间房内都藏有一定的现金。这个地方所有的房屋都 围成一圈 ,这意味着第一个房屋和最后一个房屋是紧挨着的。同时,相邻的房屋装有相互连通的防盗系统,如果两间相邻的房屋在同一…

HTTP缓存的深入介绍:Cache-Control和Vary

简介-本文范围 (Introduction - scope of the article) This series of articles deals with caching in the context of HTTP. When properly done, caching can increase the performance of your application by an order of magnitude. On the contrary, when overlooked o…

059——VUE中vue-router之路由嵌套在文章系统中的使用方法:

<!DOCTYPE html> <html lang"en"> <head><meta charset"UTF-8"><title>vue-router之路由嵌套在文章系统中的使用方法&#xff1a;</title><script src"vue.js"></script><script src"v…

web前端效率提升之浏览器与本地文件的映射-遁地龙卷风

1.chrome浏览器&#xff0c;机制是拦截url&#xff0c;      1.在浏览器Element中调节的css样式可以直接同步到本地文件&#xff0c;反之亦然&#xff0c;浏览器会重新加载css&#xff0c;省去刷新   2.在source面板下对js的编辑可以同步到本地文件&#xff0c;反之亦然…

linux : 各个发行版中修改python27默认编码为utf-8

该方法可解决robot报错&#xff1a;ascii codec cant encode character u\xf1 in position 16: ordinal not in range(128) 在下面目录中新增文件&#xff1a;sitecustomize.py 内容为 #codingutf-8 import sysreload(sys) sys.setdefaultencoding(utf8) 各个发行版放置位置&a…

归因分析_归因分析:如何衡量影响? (第2部分,共2部分)

归因分析By Lisa Cohen, Ryan Bouchard, Jane Huang, Daniel Yehdego and Siddharth Kumar由 丽莎科恩 &#xff0c; 瑞安布沙尔 &#xff0c; 黄美珍 &#xff0c; 丹尼尔Yehdego 和 亚洲时报Siddharth库马尔 介绍 (Introduction) This is our second article in a series wh…

ubuntu恢复系统_Ubuntu恢复菜单:揭开Linux系统恢复神秘面纱

ubuntu恢复系统Don’t try to convince yourself otherwise: along with all the good stuff, you’re going to have bad days with Linux.否则&#xff0c;请不要试图说服自己&#xff1a;与所有好的东西一起&#xff0c;您将在Linux上度过糟糕的日子。 You (or the users y…

linux与磁盘相关的内容

本节所讲内容1.认识SAS-SATA-SSD-SCSI-IDE硬盘2.使用fdisk对磁盘进行操作&#xff0c;分区&#xff0c;格式化3.开机自动挂载分区4.使用parted操作大于等于4T硬盘5.扩展服务器swap内存空间 MBR(Master Boot Record)主引导记录&#xff0c;也就是现有的硬盘分区模式。MBR分区的标…

leetcode 87. 扰乱字符串(dp)

使用下面描述的算法可以扰乱字符串 s 得到字符串 t &#xff1a; 如果字符串的长度为 1 &#xff0c;算法停止 如果字符串的长度 > 1 &#xff0c;执行下述步骤&#xff1a; 在一个随机下标处将字符串分割成两个非空的子字符串。即&#xff0c;如果已知字符串 s &#xff0c…

页面布局

页面布局两大类&#xff1a;   主站&#xff1a; 1 <div classpg-header> 2 <div stylewidth:980px;margin:0 auto;> 3 内容自动居中 4 </div> 5 <div classpg-content></div> 6 <div classpg-footer></div&…

sonar:默认的扫描规则

https://blog.csdn.net/liumiaocn/article/details/83550309 https://note.youdao.com/ynoteshare1/index.html?id3c1e6a08a21ada4dfe0123281637e299&typenote https://blog.csdn.net/liumiaocn/article/details/83550309 文本版&#xff1a; soanr规则java版 …

多变量线性相关分析_如何测量多个变量之间的“非线性相关性”?

多变量线性相关分析现实世界中的数据科学 (Data Science in the Real World) This article aims to present two ways of calculating non linear correlation between any number of discrete variables. The objective for a data analysis project is twofold : on the one …

wp博客写文章500错误_500多个博客文章教我如何撰写出色的文章

wp博客写文章500错误Ive written a lot of blog posts. Somewhere north of 500 to be exact. All of them are technical. 我写了很多博客文章。 确切地说是在500以北的某个地方。 所有这些都是技术性的。 About two dozen of them are actually good. 实际上大约有两打是不错…

leetcode 220. 存在重复元素 III(排序)

给你一个整数数组 nums 和两个整数 k 和 t 。请你判断是否存在 两个不同下标 i 和 j&#xff0c;使得 abs(nums[i] - nums[j]) < t &#xff0c;同时又满足 abs(i - j) < k 。 如果存在则返回 true&#xff0c;不存在返回 false。 示例 1&#xff1a; 输入&#xff1a…

ON DUPLICATE KEY UPDATE

INSERT INTO ON DUPLICATE KEY UPDATE 与 REPLACE INTO&#xff0c;两个命令可以处理重复键值问题&#xff0c;在实际上它之间有什么区别呢&#xff1f;前提条件是这个表必须有一个唯一索引或主键。1、REPLACE发现重复的先删除再插入&#xff0c;如果记录有多个字段&#xff0c…

os.path 模块

os.path.abspath(path) #返回绝对路径os.path.basename(path) #返回文件名os.path.commonprefix(list) #返回list(多个路径)中&#xff0c;所有path共有的最长的路径。os.path.dirname(path) #返回文件路径os.path.exists(path) #路径存在则返回True,路径损坏返回Falseos.path…

探索性数据分析(EDA):Python

什么是探索性数据分析(EDA)&#xff1f; (What is Exploratory Data Analysis(EDA)?) If we want to explain EDA in simple terms, it means trying to understand the given data much better, so that we can make some sense out of it.如果我们想用简单的术语来解释EDA&a…

微服务框架---搭建 go-micro环境

1.安装micro 需要使用GO1.11以上版本 #linux 下 export GO111MODULEon export GOPROXYhttps://goproxy.io # windows下设置如下环境变量 setx GO111MODULE on setx GOPROXY https://goproxy.io # 使用如下指令安装 go get -u -v github.com/micro/micro go get -u -v github.co…

angular dom_Angular 8 DOM查询:ViewChild和ViewChildren示例

angular domThe ViewChild and ViewChildren decorators in Angular provide a way to access and manipulate DOM elements, directives and components. In this tutorial, well see an Angular 8 example of how to use the two decorators.Angular中的ViewChild和ViewChild…

浪潮之巅——IT产业的三大定律

http://www.cnblogs.com/ysocean/p/7641540.html转载于:https://www.cnblogs.com/czlovezmt/p/8325772.html