Big Data Security Part One: Introducing PacketPig

Series Introduction

Packetloop CTO Michael Baker (@cloudjunky) made a big splash when he presented ‘Finding Needles in Haystacks (the Size of Countries)‘ at Blackhat Europe earlier this year. The paper outlines a toolkit based onApache Pig, Packetpig @packetpig (available on github), for doing network security monitoring and intrusion detection analysis on full packet captures using Hadoop.

In this series of posts, we’re going to introduce Big Data Security and explore using Packetpig on real full packet captures to understand and analyze networks. In this post, Michael will introduce big data security in the form of full data capture, Packetpig and Packetloop.

Introducing Packetpig

 

Intrusion detection is the analysis of network traffic to detect intruders on your network. Most intrusion detection systems (IDS) look for signatures of known attacks and identify them in real-time. Packetpig is different. Packetpig analyzes full packet captures – that is, logs of every single packet sent across your network – after the fact. In contrast to existing IDS systems, this means that using Hadoop on full packet captures, Packetpig can detect ‘zero day’ or unknown exploits on historical data as new exploits are discovered. Which is to say that Packetpig can determine whether intruders are already in your network, for how long, and what they’ve stolen or abused.

Packetpig is a Network Security Monitoring (NSM) Toolset where the ‘Big Data’ is full packet captures. Like a Tivo for your network, through its integration with Snort, p0f and custom java loaders, Packetpig does deep packet inspection, file extraction, feature extraction, operating system detection, and other deep network analysis. Packetpig’s analysis of full packet captures focuses on providing as much context as possible to the analyst. Context they have never had before. This is a ‘Big Data’ opportunity.

Full Packet Capture: A Big Data Opportunity

What makes full packet capture possible is cheap storage – the driving factor behind ‘big data.’ A standard 100Mbps internet connection can be cheaply logged for months with a 3TB disk. Apache Hadoop is optimized around cheap storage and data locality: putting spindles next to processor cores. And so what better way to analyze full packet captures than with Apache Pig – a dataflow scripting interface on top of Hadoop.

In the enterprise today, there is no single location or system to provide a comprehensive view of a network in terms of threats, sessions, protocols and files. This information is generally distributed across domain-specific systems such as IDS Correlation Engines and data stores, Netflow repositories, Bandwidth optimisation systems or Data Loss Prevention tools. Security Information and Event Monitoring systems offer to consolidate this information but they operate on logs – a digest or snippet of the original information. They don’t provide full fidelity information that can be queried using the exact copy of the original incident.

Packet captures are a standard binary format for storing network data. They are cheap to perform and the data can be stored in the cloud or on low-cost disk in the Enterprise network. The length of retention can be based on the amount of data flowing through the network each day and the window of time you want to be able to peer into the past.

Pig, Packetpig and Open Source Tools

In developing Packetpig, Packetloop wanted to provide free tools for the analysis network packet captures that spanned weeks, months or even years. The simple questions of capture and storage of network data had been solved but no one had addressed the fundamental problem of analysis. Packetpig utilizes the Hadoop stack for analysis, which solves this problem.

For us, wrapping Snort and p0f was a bit of a homage to how much security professionals value and rely on open source tools. We felt that if we didn’t offer an open source way of analysing full packet captures we had missed a real opportunity to pioneer in this area. We wanted it to be simple, turn key and easy for people to take our work and expand on it. This is why Apache Pig was selected for the project.

Understanding your Network

One of the first data sets we were given to analyse was a 3TB data set from a customer. It was every packet in and out of their 100Mbps internet connection for 6 weeks. It contained approximately 500,000 attacks. Making sense of this volume of information is incredibly difficult with current tooling. Even Network Security Monitoring (NSM) tools have difficult with this size of data. However it’s not just size and scale. No existing toolset allowed you to provide the same level of context. Packetpig allows you to join together information related to threats, sessions, protocols (deep packet inspection) and files as well as Geolocation and Operating system detection information.

We are currently logging all packets for a website for six months. This data set is currently around 0.6TB and because all the packet captures are stored in S3 we can quickly scan through the dataset. More importantly, we can run a job every nightly or every 15 minutes to correlate attack information with other data from Packetpig to provide an ultimate amount of context related to security events.

Items of interest include:

  • Detecting anomalies and intrusion signatures
  • Learn timeframe and identity of attacker
  • Triage incidents
  • “Show me packet captures I’ve never seen before.”

“Never before seen” is a powerful filter and isn’t limited to attack information. First introduced by Marcus Ranum,“never before seen” can be used to rule out normal network behaviour and only show sources, attacks, and traffic flows that are truly anomalous. For example, think in terms of the outbound communications from a Web Server. What attacks, clients and outbound communications are new or have never been seen before? In an instant you get an understanding that you don’t need to look for the normal, you are straight away looking for the abnormal or signs of misuse.

Agile Data

Packetloop uses the stack and iterative prototyping techniques outlined in the forthcoming book by Hortonworks’ ownRussell Jurney, Agile Data (O’Reilly, March 2013). We use Hadoop, Pig, Mongo and Cassandra to explore datasets and help us encode important information into d3 visualisations. Currently we use all of these tools to aid in our research before we add functionality to Packetloop. These prototypes become the palette our product is built from.

转载于:https://www.cnblogs.com/diyunpeng/p/4513942.html

本文来自互联网用户投稿,该文观点仅代表作者本人,不代表本站立场。本站仅提供信息存储空间服务,不拥有所有权,不承担相关法律责任。如若转载,请注明出处:http://www.mzph.cn/news/492256.shtml

如若内容造成侵权/违法违规/事实不符,请联系多彩编程网进行投诉反馈email:809451989@qq.com,一经查实,立即删除!

相关文章

美国科学家成功恢复老年人工作记忆,望奠定认知干预疗法基础

来源:科技日报伴随着人体衰老,大脑对信息加工和贮存的能力也必然会下降,但如果这种能力可以被逆转呢?据英国《Nature Neuroscience》杂志8日在线发表的一项研究,美国科学家通过同步脑区节律——按特定节律刺激颞叶和额…

广告机市场和产品

广告机行业分析 前言 关于广告机,不同的人有不同的认识,此部分主要说明文中所指广告机的范围。 在讨论前,先区分几个概念。 1、 广告机与数字标牌 数字标牌是数字标示牌的简称,指代显示任意信息的显示终端;广告机是指…

《工业大数据白皮书》2019版正式发布

来源:悟空智能科技近日,由中国电子技术标准化研究院、全国信息技术标准化技术委员会大数据标准工作组主编,工业大数据产业应用联盟联合主编,联想集团等多家联盟成员企业参编的《工业大数据白皮书(2019版)》…

产品金字塔

制造业产品大概可以分为三个层次,第一层为实物,第二层为服务,第三层为体验。其中产品为基础,服务为改善,共同给客户美好的体验。三者组成一个产品金字塔。第一层,实物。比如手机,满足通话、上网…

学界 | UC伯克利发布一个低成本家居机器人,会叠衣服、会泡咖啡

BLUE 机器人叠毛巾,实际花费时间约 50 秒来源:AI科技评论加州大学伯克利分校 Pieter Abbeel 教授今天在推特公开宣布了伯克利机器人学习实验室最新开发的机器人 BLUE。这款机器人的特点是低成本(不到 5000 美元)、基于 AI 控制&am…

JAVA并发编程3_线程同步之synchronized关键字

在上一篇博客里讲解了JAVA的线程的内存模型,见:JAVA并发编程2_线程安全&内存模型,接着上一篇提到的问题解决多线程共享资源的情况下的线程安全问题。 不安全线程分析 public class Test implements Runnable {private int i 0;private i…

深度学习技术发展趋势浅析

来源:中国信息通信研究院CAICT当前,人工智能发展借助深度学习技术突破得到了全面关注和助力推动,各国政府高度重视、资本热潮仍在加码,各界对其成为发展热点也达成了共识。本文旨在分析深度学习技术现状,研判深度学习发…

(含Matlab源码)算术编码(arithmetic coding)的underflow问题

0、文章结构 文章的行文逻辑如下,看官可以根据需要跳读,节省时间。 1、介绍underflow和overflow. 2、underflow问题起源 3、underflow问起探索 4、underflow和overflow的常见情形 5、处理一些溢出问题的小技巧 6、对其中的两种小技巧的优缺点比较…

人类与AI结合的最佳形态是什么样?|A16Z内部万字报告

来源:A16Z合伙人Frank Chen2017年7月,我发布了一个关于人工智能、机器学习和深度学习的入门视频。从那以后,我一直痴迷于阅读关于机器学习的报道。一般来说,你会在媒体的头版上看到两类报道。一类报道的标题是“机器人来抢你的工作…

(Matlab源码)Matlab实现算术编码(Arithmetic coding)超级详解(每一段代码都可以看懂)

1、代码功能 输入:一个字符串输出: codeword(码值) codeword所占的位数 2、代码框图 3、代码超详解 统计字符串中的字符种类,调用函数,放入数组b中。 bunique(str1); 统计每种字符的个数,放入…

(含Python源码)Python实现K阶多项式的5种回归算法(regression)

0、文章结构 为了方便客官根据需要取阅,节约时间,文章目录结构如下: 问题描述理论部分:五种回归算法两种Python读取文件的方法Python实现五种回归算法使用的工具箱总结 1、问题描述 K阶多项式表达式 其中, 现有数据…

首张人类黑洞照片的背后

摘要:沈海军:今天(2019年4月10日)下午接受广东卫视采访,就晚上21:00即将发布的人类首张黑洞照片发表了评论。提笔撰稿时,尚未到照片官方的发布时间,故不能一睹黑洞照片的芳容,但鉴于…

一场“交通进化”将至: 5G带给车联网与自动驾驶哪些升级?

来源 :腾讯科技作者:李俨 美国高通公司技术标准高级总监5G时代已经来临,走向商用的步伐也在逐渐加快。腾讯科技联合优质科技CP以及行业专家推出“5G局中局”系列文章,为你解读5G在通讯、物联网、车联网、工业联网、边缘计算、云服…

特斯拉发布Q1无人驾驶安全报告:事故增多 但还是比人类少

来源:聚焦AI的近日,电动汽车制造商特斯拉发布了2019年第一季度自动驾驶仪(Autopilot)安全性报告,这是特斯拉发布的第三份类似报告,此前该公司认为媒体对特斯拉车辆事故的报道有失公允,因此他们开…

log4net 小记

突然想到想测试一下log4net,结果折腾了两天,才弄出来.....记录下来以备以后查看 背景:vs2013 mvc项目中想体验下log4net的功能(主要是文件记录) 翻看了log4net的相关资料,才发现其实它是有很多功能的&#…

关于找工作和选专业的思考

个人的成长离不开国家的发展,国家的发展离不开国际的大气候。 国家之间的竞争主要归于经济竞争,经济的核心在于产业,产业的核心则在于科学和技术。 无论是找工作还是选专业,赌的都是对未来趋势的预测,没有人会希望自己…

DARPA 2020财年研发预算 人工智能应用研究投资急剧增长

来源:美国国防部等摘要:2019年3月,特朗普政府公布2020财年预算申请。根据预算法案,2020财年美国国家安全预算总额增加340亿美元,达到7500亿美元,比上年增加5%。美国防部分得的经费为7180亿美元,…

使用Matlab(R2018b)画复杂函数的图形(网格图meshgrid)及等高线contour

1、函数 这里使用2D Michalewicz 函数,其表达式为: 2、画图 2.1 编写2D Michalewicz 函数 f(x,y)(-sin(x).*(sin(x.^2/3.1415926)).^(2*m)...-sin(y).*(sin(2*y.^2/3.1415926)).^(2*m)); 上述代码使用到了Anonymous Functions, 相关变量的数据类型为…

科创板:中国科技产业新引擎

来源:国信研究作者:杨均明,国信证券经济研究所所长万众期待的上海证券交易所科创板即将推出,科创板股票发行审核规则第三条要求,发行人申请股票首次发行上市,应当符合科创板定位,面向世界科技前…

粒子群优化算法(Particle Swarm Optimization)的 Matlab(R2018b)代码实现

这里以 2D Michalewicz function 为对象来演示粒子群算法。 1、Michalewicz function 2、代码详解 2.1 画Michalewicz函数的网格图形 f(x,y)(-sin(x).*(sin(x.^2/3.1415926)).^(2*m)...-sin(y).*(sin(2*y.^2/3.1415926)).^(2*m));range[0 4 0 4]; Ngrid100; dx(range(2)-ran…