流式数据分析_流式大数据分析

流式数据分析

The recent years have seen a considerable rise in connected devices such as IoT [1] devices, and streaming sensor data. At present there are billions of IoT devices connected to the internet. While you read this article, terabytes and petabytes of IoT data will be generated across the world. This data contains a huge amount of information and insights. However, processing such high volumes of streaming data is a challenge, and it requires advanced BigData capabilities to manage these challenges and derive insights from this data.

近年来,诸如IoT [1]设备之类的连接设备以及流式传感器数据有了显着增长。 目前,有数十亿物联网设备连接到互联网。 当您阅读本文时,将在全球范围内生成TB和PB的IoT数据。 这些数据包含大量信息和见解。 但是,处理如此大量的流数据是一个挑战,它需要高级BigData功能来管理这些挑战并从这些数据中获取见解。

At AlgoAnalytics, we have developed a powerful tool which ingests real time streaming data feeds (for example from IoT devices) to enable visualization and analytics for quicker business decisions.

AlgoAnalytics ,我们开发了一个功能强大的工具,可提取实时流数据馈送(例如,从IoT设备获取),以实现可视化和分析,以便更快地做出业务决策。

The four steps involved underneath Streaming Big Data Analytics are as follows :

流式大数据分析所涉及的四个步骤如下:

The high level design of Streaming Big Data Analytics pipeline is illustrated in Figure 1.

图1显示了Streaming Big Data Analytics管道的高级设计。

Image for post
Figure 1: High Level Design
图1:高级设计
  1. Data Ingestion:

    数据提取:

Data ingestion involves gathering data from various streaming sources (e.g. IoT sensors) and transporting them to a common data store. This essentially is transforming unstructured data from origin to a system where it can be stored for further processing. Data comes from various sources, in various formats and at various speeds. It is a critical task to ingest complete data into the pipeline without any failure.

数据摄取涉及从各种流媒体源(例如IoT传感器)收集数据并将其传输到公共数据存储。 这实质上是将非结构化数据从原始数据转换为可以存储数据以进行进一步处理的系统。 数据来自各种来源,格式和速度各异。 将完整的数据摄取到管道中而没有任何失败是至关重要的任务。

For Data Ingestion, we have used Apache Kafka [2]- a distributed messaging system which fulfills all the above requirements. We have built a high scalable fault tolerant multi-node kafka cluster which can process thousands of messages per second without any data loss and down time. Kafka Producer collects data from various sources and publishes data to different topics accordingly. Kafka Consumer consumes this data from the topics in which they are interested in.This way data from different sources is ingested in the pipeline for processing.

对于数据提取,我们使用了Apache Kafka [2]-一种满足所有上述要求的分布式消息传递系统。 我们建立了一个高度可扩展的容错多节点kafka集群,该集群可以每秒处理数千条消息,而不会造成任何数据丢失和停机时间。 Kafka Producer从各种来源收集数据,并相应地将数据发布到不同的主题。 Kafka Consumer从他们感兴趣的主题中消费此数据。这样,来自不同来源的数据就会被吸收到管道中进行处理。

2. Real Time Data Processing:

2.实时数据处理:

The data collected in the above step needs to be processed in real time before pushing it to any filesystem or database. This includes transforming unstructured data to structured data. Processing includes filtering, mapping, conversion of data types, removing unwanted data, generating simplified data from complex data,etc

在将上一步中收集的数据推送到任何文件系统或数据库之前,需要对其进行实时处理。 这包括将非结构化数据转换为结构化数据。 处理包括过滤,映射,数据类型转换,删除不需要的数据,从复杂数据生成简化数据等。

For this step we have used Spark Streaming [3] which is the best combination with Apache Kafka to build real time applications. Spark Streaming is an extension of the core Spark API that enables scalable, high-throughput, fault-tolerant stream processing of live data streams. Spark Streaming receives the data ingested through kafka and converts it into continuous stream of RDDs — DStreams (basic abstraction in spark streaming). Various spark transformations are applied on these DStreams to transform the data to the state from where it can be pushed to the database.

在这一步中,我们使用了Spark Streaming [3],它是与Apache Kafka的最佳组合,用于构建实时应用程序。 Spark Streaming是核心Spark API的扩展,可实现实时数据流的可扩展,高吞吐量,容错流处理。 Spark Streaming接收通过kafka提取的数据,并将其转换为RDD的连续流-DStreams(Spark流中的基本抽象)。 在这些DStream上应用了各种spark转换,以将数据转换到可以将其推送到数据库的状态。

3. Data Storage:

3.数据存储:

The data received from source devices (such as IoT devices) is time-series data — measurements or events that are tracked, monitored, downsampled, and aggregated over time. Properties that make time series data very different from other data workloads are data lifecycle management, summarization, and large range scans of many records. A time series database (TSDB) [4] is a database optimized for such time-stamped or time series data with time as a key index which is distinctly different from relational databases . A time-series database lets you store large volumes of time stamped data in a format that allows fast insertion and fast retrieval to support complex analysis on that data.

从源设备(例如IoT设备)接收的数据是时间序列数据 -随时间跟踪,监视,下采样和聚合的测量或事件。 使时间序列数据与其他数据工作负载非常不同的属性是数据生命周期管理,摘要和许多记录的大范围扫描。 时间序列数据库(TSDB) [4]是针对时间标记或时间序列数据进行优化的数据库,其中时间作为关键索引,与关系数据库明显不同。 时间序列数据库允许您以允许快速插入和快速检索的格式存储大量带时间戳的数据,以支持对该数据进行复杂的分析。

Influxdb [5] is one such time-series database designed to handle such high write and query loads. We have set up a multi node influxdb cluster which can handle millions of writes per second and also in-memory indexing of influxdb allows fast and efficient query results. We have also set up various continuous tasks which downsample the data to lower precision, summarized data which can be kept for a longer period of time or forever. It reduces the size of data that needs to be stored as well as the query time by multiple times as compared with very high precision data.

Influxdb [5]是一种此类时间序列数据库,旨在处理如此高的写入和查询负载。 我们已经建立了一个多节点的influxdb集群,该集群可以每秒处理数百万次写入,并且influxdb的内存索引可以实现快速,有效的查询结果。 我们还设置了各种连续任务,这些任务会将数据降采样到较低的精度,汇总的数据可以保留更长的时间或永远。 与非常高精度的数据相比,它可以将需要存储的数据大小以及查询时间减少多次。

4. Visualization:

4.可视化:

To add value to this processed data it is necessary to visualize our data and make some relations between them. Data visualization and analytics provide more control over data and give us the power to control this data efficiently.

为了给处理后的数据增加价值,有必要使我们的数据可视化并在它们之间建立某种关系。 数据可视化和分析可提供对数据的更多控制,并使我们能够有效地控制此数据。

We used Grafana [6], a multi-platform open source analytics and interactive visualization web application. It provides charts, graphs, and alerts for the web when connected to supported data sources. We have created multiple dashboards for different comparisons. On these dashboards, we can visualize real time status as well as the historical data (weeks, months or even years). We can also compare data of the same type with different parameters. Several variables are defined which provide flexibility to use dashboards for multiple visualizations. For example, we can select a single device or multiple devices or even all devices at a time. We can select how to aggregate data per minute, per hour to per year.

我们使用了Grafana [6],这是一个多平台的开源分析和交互式可视化Web应用程序。 当连接到受支持的数据源时,它会为Web提供图表,图形和警报。 我们创建了多个仪表盘用于不同的比较。 在这些仪表板上,我们可以可视化实时状态以及历史数据(几周,几个月甚至几年)。 我们还可以将具有不同参数的相同类型的数据进行比较。 定义了几个变量,这些变量可灵活使用仪表板进行多个可视化。 例如,我们可以一次选择一个或多个设备,甚至所有设备。 我们可以选择如何每分钟,每小时和每年汇总数据。

Image for post
Figure 2 : One of the dashboards IoT Analytics Application
图2:仪表板IoT分析应用程序之一

Figure 2 shows the Uptime and some parameters of a selected machine for a selected period (2 months).

图2显示了选定时间段(2个月)内选定机器的正常运行时间和一些参数。

Applications :

应用范围

As a large number of businesses in multiple sectors are moving to connected and smart devices, Streaming Big Data Analytics finds its applications across many verticals.

随着多个领域的众多企业正在转向互联和智能设备,Streaming Big Data Analytics在许多垂直领域都可以找到其应用程序。

Few examples include real time machine monitoring and anomaly detection in industries, sensor embedded medical devices to understand emergencies in advance, surveillance using video analytics, in Retail and Logistics to increase sale by studying customer movements, in transport sector — smart traffic control, electronic toll collections systems, in Military for surveillance, Environmental monitoring — air quality, soil conditions, movement of wildlife, etc

很少有这样的例子:行业中的实时机器监控和异常检测,传感器嵌入式医疗设备可以提前了解紧急情况,零售和物流中使用视频分析进行监控,通过研究运输行业的客户动向来提高销售量,例如交通领域,智能交通控制,电子通行费军事上用于监视,环境监测的采集系统-空气质量,土壤条件,野生动植物的移动等

For further information, please contact: info@algoanalytics.com

欲了解更多信息,请联系:info@algoanalytics.com

  1. IoT : https://en.wikipedia.org/wiki/Internet_of_things

    物联网 https://zh.wikipedia.org/wiki/Internet_of_things

  2. Apache Kafka : https://kafka.apache.org/documentation/#gettingStarted

    Apache Kafka: https : //kafka.apache.org/documentation/#gettingStarted

  3. Spark Streaming : https://spark.apache.org/docs/latest/streaming-programming-guide.html

    火花流: https : //spark.apache.org/docs/latest/streaming-programming-guide.html

  4. Time Series Database : https://www.influxdata.com/time-series-database/

    时间序列数据库: https : //www.influxdata.com/time-series-database/

  5. InfluxDB : https://www.influxdata.com/products/influxdb-overview/

    InfluxDB: https : //www.influxdata.com/products/influxdb-overview/

  6. Grafana : https://grafana.com/docs/

    Grafana: https ://grafana.com/docs/

翻译自: https://medium.com/algoanalytics/streaming-big-data-analytics-d4311ed20581

流式数据分析

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/388614.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

Jenkins自动化CI CD流水线之8--流水线自动化发布Java项目

一、前提 插件:Maven Integration plugin 环境: maven、tomcat 用的博客系统代码: git clone https://github.com/b3log/solo.git 远端git服务器: [gitgit repos]$ mkdir -p solo [gitgit repos]$ cd solo/ [gitgit solo]$ git --…

数据科学还是计算机科学_数据科学101

数据科学还是计算机科学什么是数据科学? (What is data science?) Well, if you have just woken up from a 10-year coma and have no idea what is data science, don’t worry, there’s still time. Many years ago, statisticians had some pretty good ideas…

开机流程与主引导分区(MBR)

由于操作系统会提供所有的硬件并且提供内核功能,因此我们的计算机就能够认识硬盘内的文件系统,并且进一步读取硬盘内的软件文件与执行该软件来完成各项软件的执行目的 问题是你有没有发现,既然操作系统也是软件,那么我的计算机优势…

肤色检测算法 - 基于二次多项式混合模型的肤色检测。

由于CSDN博客和博客园的编辑方面有不一致的地方,导致文中部分图片错位,为不影响浏览效果,建议点击打开链接。 由于能力有限,算法层面的东西自己去创新的很少,很多都是从现有的论文中学习,然后实践的。 本文…

oracle解析儒略日,利用to_char获取当前日期准确的周数!

总的来说周数的算法有两种:算法一:iw算法,每周为星期一到星期日算一周,且每年的第一个星期一为第一周,就拿2014年来说,2014-01-01是星期三,但还是算为今年的第一周,可以简单的用sql函…

js有默认参数的函数加参数_函数参数:默认,关键字和任意

js有默认参数的函数加参数PYTHON开发人员的提示 (TIPS FOR PYTHON DEVELOPERS) Think that you are writing a function that accepts multiple parameters, and there is often a common value for some of these parameters. For instance, you would like to be able to cal…

2018大数据学习路线从入门到精通

最近很多人问小编现在学习大数据这么多,他们都是如何学习的呢。很多初学者在萌生向大数据方向发展的想法之后,不免产生一些疑问,应该怎样入门?应该学习哪些技术?学习路线又是什么?今天小编特意为大家整理了…

相似邻里算法_纽约市-邻里之战

相似邻里算法IBM Data Science Capstone ProjectIBM Data Science Capstone项目 分析和可视化与服装店投资者的要求有关的纽约市结构 (Analyzing and visualizing the structure of New York City in relation to the requirements of a Clothing Store Investor) 介绍 (Introd…

linux质控命令,Linux下microRNA质控-cutadapt安装

如果Linux系统已安装pip或conda,cutadapt的安装相对简便一些,示例如下:1.pip安装pip install --user --upgrade cutadapt添加环境变量echo export PATH$PATH:/your path/cutadapt-1.10/bin >> ~/.bashrc2.conda安装conda install -c b…

linux分辨率和用户有关吗,Linux系统在高分屏非正常分辨率显示

问题描述:win10重装为Ubuntu16.04,在1920x1080的显示屏上,linux系统分辨率只有800x600xrandr # 查看当前显示分辨率#输出:[Screen 0: minimum 800 x 600, current 800 x 600, maximum 800 x 600]可以看出显示屏最小为800x600&…

数据透视表和数据交叉表_数据透视表的数据提取

数据透视表和数据交叉表Consider the data of healthcare drugs as provided in the excel sheet. The concept of pivot tables in python allows you to extract the significance from a large detailed dataset. A pivot table helps in tracking only the required inform…

金融信息交换协议(FIX)v5.0

1. 什么是FIXFinancial Information eXchange(FIX)金融信息交换协议的制定是由多个致力于提升其相互间交易流程效率的金融机构和经纪商于1992年共同发起。这些企业把他们及他们的行业视为一个整体,认为能够从对交易指示,交易指令及交易执行的高效电子数…

linux行命令测网速,Linux命令行测试网速的方法

最近给服务器调整了互联网带宽的限速策略,调到100M让自己网站也爽一下。一般在windows上我喜欢用speedtest.net来测试,测速结果也被大家认可。在linux上speedtest.net提供了一个命令行工具speedtest-cli,用起来很方便,这里分享一下…

图像处理傅里叶变换图像变化_傅里叶变换和图像床单视图。

图像处理傅里叶变换图像变化What do Fourier Transforms do? What do the Fourier modes represent? Why are Fourier Transforms notoriously popular for data compression? These are the questions this article aims to address using an interesting analogy to repre…

C#DNS域名解析工具(DnsLookup)

C#DNS域名解析工具(DnsLookup) DNS域名解析工具:DnsLookup 输入域名后点击Resolve按钮即可。 主要实现代码如下: private void btnResolve_Click ( object sender, EventArgs e ) {lstIPs.Items.Clear ( ); //首先把结果里的ListBox清空 try {IPHostE…

滞后分析rstudio_使用RStudio进行A / B测试分析

滞后分析rstudioThe purpose of this article is to provide some guide on how to conduct analysis of a sample scenario A/B test results using R, evaluate the results and draw conclusions based on the analysis.本文的目的是提供一些指南,说明如何使用R对…

Linux程序实现弹框,jQuery实现弹出框 效果绝对美观

使用到JQeury写的几个比较好的Popup DialogBox,觉得不错。和大家分享下。使用它们结合.net可以实现很好的效果。1.jqpopup:是个可以拖拽,缩放并可以在它上面显示html页面上任何一个控件组合的控件。可以和后面的主页面通信。使用方法:先调用这几个js文件,可以自提供的下载地址下…

MySQL的事务-原子性

MySQL的事务处理具有ACID的特性,即原子性(Atomicity)、一致性(Consistency)、隔离性(Isolation)和持久性(Durability)。 1. 原子性指的是事务中所有操作都是原子性的,要…

大型网站架构演变

今天我们来谈谈一个网站一般是如何一步步来构建起系统架构的,虽然我们希望网站一开始就能有一个很好的架构,但马克思告诉我们事物是在发展中不断前进的,网站架构也是随着业务的扩大、用户的需求不断完善的,下面是一个网站架构逐步…

linux的磁盘磁头瓷片作用,Linux 磁盘管理

硬盘物理结构以下三张图片都是磁盘的实物图,一个磁盘是由多块堆放的瓷片组成的,所以磁头的结构也是堆叠的,他要对每一块瓷片进行读取,磁头是可以在不同磁道(在瓷片的表现为不同直径的同心圆,磁道间是有间隔的)之间移动…