Spark入门（八）之WordCount

一、WordCount

计算文本里面的每个单词出现的个数，输出结果。

二、maven设置

<?xml version="1.0" encoding="UTF-8"?><project xmlns="http://maven.apache.org/POM/4.0.0" xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"xsi:schemaLocation="http://maven.apache.org/POM/4.0.0 http://maven.apache.org/xsd/maven-4.0.0.xsd"><modelVersion>4.0.0</modelVersion><groupId>com.mk</groupId><artifactId>spark-test</artifactId><version>1.0</version><name>spark-test</name><url>http://spark.mk.com</url><properties><project.build.sourceEncoding>UTF-8</project.build.sourceEncoding><maven.compiler.source>1.8</maven.compiler.source><maven.compiler.target>1.8</maven.compiler.target><scala.version>2.11.1</scala.version><spark.version>2.4.4</spark.version><hadoop.version>2.6.0</hadoop.version></properties><dependencies><!-- scala依赖--><dependency><groupId>org.scala-lang</groupId><artifactId>scala-library</artifactId><version>${scala.version}</version></dependency><!-- spark依赖--><dependency><groupId>org.apache.spark</groupId><artifactId>spark-core_2.11</artifactId><version>${spark.version}</version></dependency><dependency><groupId>org.apache.spark</groupId><artifactId>spark-sql_2.11</artifactId><version>${spark.version}</version></dependency><dependency><groupId>junit</groupId><artifactId>junit</artifactId><version>4.11</version><scope>test</scope></dependency></dependencies><build><pluginManagement><plugins><plugin><artifactId>maven-clean-plugin</artifactId><version>3.1.0</version></plugin><plugin><artifactId>maven-resources-plugin</artifactId><version>3.0.2</version></plugin><plugin><artifactId>maven-compiler-plugin</artifactId><version>3.8.0</version></plugin><plugin><artifactId>maven-surefire-plugin</artifactId><version>2.22.1</version></plugin><plugin><artifactId>maven-jar-plugin</artifactId><version>3.0.2</version></plugin></plugins></pluginManagement></build>
</project>

三、编程代码

public class WordCountApp implements SparkConfInfo{public static void main(String[]args){String filePath = "F:\\test\\log.txt";SparkSession sparkSession = new WordCountApp().getSparkConf("WordCount");Map<String, Integer> wordCountMap = sparkSession.sparkContext().textFile(filePath, 4).toJavaRDD().flatMap(v -> Arrays.asList(v.split("[(\\s+)(\r?\n),.。'’]")).iterator()).filter(v -> v.matches("[a-zA-Z-]+")).map(String::toLowerCase).mapToPair(v -> new Tuple2<>(v, 1)).reduceByKey(Integer::sum).collectAsMap();wordCountMap.forEach((k, v) -> System.out.println(k + ":" + v));sparkSession.stop();}
}public interface SparkConfInfo {default SparkSession getSparkConf(String appName){SparkConf sparkConf = new SparkConf();if(System.getProperty("os.name").toLowerCase().contains("win")) {sparkConf.setMaster("local[4]");System.out.println("使用本地模拟是spark");}else{sparkConf.setMaster("spark://hadoop01:7077,hadoop02:7077,hadoop03:7077");sparkConf.set("spark.driver.host","192.168.150.1");//本地ip，必须与spark集群能够相互访问，如：同一个局域网sparkConf.setJars(new String[] {".\\out\\artifacts\\spark_test\\spark-test.jar"});//项目构建生成的路径}SparkSession session = SparkSession.builder().appName(appName).config(sparkConf).config(sparkConf).getOrCreate();return session;}
}

文件内容

Spark Streaming is an extension of the core Spark API that enables scalable,high-throughput, fault-tolerant stream processing of live 。data streams. Data， can be ，ingested from many sources like Kafka, Flume, Kinesis, or TCP sockets, and can be processed using complex algorithms expressed with high-level functions like map, reduce, join and window. Finally, processed data can be pushed out to filesystems,Spark Streaming provides a high-level abstraction called discretized stream or DStream, which represents a continuous stream of data. DStreams can be created either from input data streams from sources such as Kafka, Flume, and Kinesis, or by applying high-level operations on other DStreams. Internally, a DStream is represented as a sequence of RDDs.This guide shows you how to start writing Spark Streaming programs with DStreams. You can write Spark Streaming programs in Scala, Java or Python (introduced in Spark 1.2), all of which are presented in this guide. You will find tabs throughout this guide that let you choose between code snippets of different languages. databases, and live dashboards. In fact, you can apply Spark’s machine learning and graph processing algorithms on data streams.

输出

created:1
continuous:1
high-level:3
either:1
reduce:1
many:1
writing:1
learning:1
sources:2
is:2
spark:7
can:6
high-throughput:1
filesystems:1
using:1
rdds:1
of:6
input:1
scala:1
you:5
operations:1
kinesis:2
fact:1
or:4
provides:1
pushed:1
how:1
will:1
join:1
databases:1
window:1
be:4
data:5
from:3
s:1
abstraction:1
languages:1
to:2
all:1
and:5
that:2
fault-tolerant:1
core:1
a:4
expressed:1
internally:1
streaming:4
on:2
dashboards:1
java:1
let:1
processed:2
with:2
write:1
by:1
between:1
in:4
live:2
like:2
represented:1
code:1
are:1
stream:3
algorithms:2
sequence:1
streams:3
graph:1
an:1
flume:2
apply:1
kafka:2
sockets:1
the:1
out:1
presented:1
snippets:1
extension:1
scalable:1
guide:3
dstream:2
choose:1
represents:1
dstreams:3
find:1
shows:1
programs:2
such:1
functions:1
called:1
tcp:1
machine:1
api:1
throughout:1
tabs:1
start:1
which:2
this:3
different:1
processing:2
applying:1
enables:1
complex:1
finally:1
introduced:1
python:1
other:1
discretized:1
map:1
as:2

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/322468.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！