目录
0. 相关文章链接
1. RDD队列
1.1. 用法及说明
1.2. 案例实操
2. 自定义数据源
2.1. 用法和说明
2.2. 案例实操
3. Kafka数据源
3.1. 版本选型
3.2. Kafka 0-8 Receiver 模式(当前3.x版本不适用)
3.3. Kafka 0-8 Direct 模式(当前3.x版本不适用)
3.4. Kafka 0-10 Direct 模式(3.x版本中使用此模式)
0. 相关文章链接
Spark文章汇总
1. RDD队列
1.1. 用法及说明
测试过程中,可以通过使用 ssc.queueStream(queueOfRDDs)来创建 DStream,每一个推送到这个队列中的 RDD,都会作为一个 DStream 处理。
1.2. 案例实操
- 需求:循环创建几个 RDD,将 RDD 放入队列。通过 SparkStream 创建 Dstream,计算WordCount
- 编写代码:
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}import scala.collection.mutableobject StreamTest {def main(args: Array[String]): Unit = {//1.初始化Spark配置信息val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamTest")//2.初始化SparkStreamingContextval ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))//3.创建RDD队列val rddQueue: mutable.Queue[RDD[Int]] = new mutable.Queue[RDD[Int]]()//4.创建QueueInputDStreamval inputStream: InputDStream[Int] = ssc.queueStream(rddQueue, oneAtATime = false)//5.处理队列中的RDD数据val mappedStream: DStream[(Int, Int)] = inputStream.map(((_: Int), 1))val reducedStream: DStream[(Int, Int)] = mappedStream.reduceByKey((_: Int) + _)//6.打印结果reducedStream.print()//7.启动任务ssc.start()//8.循环创建并向RDD队列中放入RDDfor (i <- 1 to 5) {rddQueue += ssc.sparkContext.makeRDD(1 to 300, 10)Thread.sleep(2000)}ssc.awaitTermination()}}
- 数据展示:
-------------------------------------------
Time: 1689147795000 ms
-------------------------------------------
(84,1)
(96,1)
(120,1)
(180,1)
(276,1)
(156,1)
(216,1)
(300,1)
(48,1)
(240,1)
...-------------------------------------------
Time: 1689147798000 ms
-------------------------------------------
(84,2)
(96,2)
(120,2)
(180,2)
(276,2)
(156,2)
(216,2)
(300,2)
(48,2)
(240,2)
...-------------------------------------------
Time: 1689147801000 ms
-------------------------------------------
(84,1)
(96,1)
(120,1)
(180,1)
(276,1)
(156,1)
(216,1)
(300,1)
(48,1)
(240,1)
...-------------------------------------------
Time: 1689147804000 ms
-------------------------------------------
(84,1)
(96,1)
(120,1)
(180,1)
(276,1)
(156,1)
(216,1)
(300,1)
(48,1)
(240,1)
...-------------------------------------------
Time: 1689147807000 ms
--------------------------------------------------------------------------------------
Time: 1689147810000 ms
-------------------------------------------
2. 自定义数据源
2.1. 用法和说明
需要继承 Receiver,并实现 onStart、 onStop 方法来自定义数据源采集。
2.2. 案例实操
- 需求:自定义数据源,实现监控某个端口号,获取该端口号内容。
- 自定义数据源
import org.apache.spark.storage.StorageLevel
import org.apache.spark.streaming.receiver.Receiver
import java.io.{BufferedReader, InputStreamReader}
import java.net.Socket
import java.nio.charset.StandardCharsetsclass CustomerReceiver(host: String, port: Int) extends Receiver[String](StorageLevel.MEMORY_ONLY) {//最初启动的时候,调用该方法,作用为:读数据并将数据发送给Sparkoverride def onStart(): Unit = {new Thread("Socket Receiver") {override def run() {receive()}}.start()}//读数据并将数据发送给Sparkdef receive(): Unit = {//创建一个Socketval socket: Socket = new Socket(host, port)//定义一个变量,用来接收端口传过来的数据var input: String = null//创建一个BufferedReader用于读取端口传来的数据val reader: BufferedReader = new BufferedReader(new InputStreamReader(socket.getInputStream, StandardCharsets.UTF_8))//读取数据input = reader.readLine()//当receiver没有关闭并且输入数据不为空,则循环发送数据给Sparkwhile (!isStopped() && input != null) {store(input)input = reader.readLine()}//跳出循环则关闭资源reader.close()socket.close()//重启任务restart("restart")}override def onStop(): Unit = {}}
- 使用自定义的数据源采集数据
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, ReceiverInputDStream}
import org.apache.spark.streaming.{Seconds, StreamingContext}object StreamTest {def main(args: Array[String]): Unit = {//1.初始化Spark配置信息val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamTest")//2.初始化SparkStreamingContextval ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))//3.创建自定义receiver的Streamingval lineStream: ReceiverInputDStream[String] = ssc.receiverStream(new CustomerReceiver("localhost", 9999))//4.将每一行数据做切分,形成一个个单词val wordStream: DStream[String] = lineStream.flatMap(_.split(" "))//5.将单词映射成元组(word,1)val wordAndOneStream: DStream[(String, Int)] = wordStream.map((_, 1))//6.将相同的单词次数做统计val wordAndCountStream: DStream[(String, Int)] = wordAndOneStream.reduceByKey(_ + _)//7.打印wordAndCountStream.print()//8.启动SparkStreamingContextssc.start()ssc.awaitTermination()}}
- 展示结果
-------------------------------------------
Time: 1689148212000 ms
--------------------------------------------------------------------------------------
Time: 1689148215000 ms
-------------------------------------------
(abc,2)
(hello,1)-------------------------------------------
Time: 1689148218000 ms
-------------------------------------------
3. Kafka数据源
3.1. 版本选型
- ReceiverAPI:需要一个专门的 Executor 去接收数据,然后发送给其他的 Executor 做计算。存在的问题,接收数据的 Executor 和计算的 Executor 速度会有所不同,特别在接收数据的 Executor 速度大于计算的 Executor 速度,会导致计算数据的节点内存溢出。早期版本中提供此方式,当前版本不适用。
- DirectAPI:是由计算的 Executor 来主动消费 Kafka 的数据,速度由自身控制。
3.2. Kafka 0-8 Receiver 模式(当前3.x版本不适用)
- 需求:通过 SparkStreaming 从 Kafka 读取数据,并将读取过来的数据做简单计算,最终打印到控制台
- 导入依赖:
<dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming-kafka-0-8_2.11</artifactId><version>2.4.5</version>
</dependency>
- 编写代码:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.ReceiverInputDStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}object StreamTest {def main(args: Array[String]): Unit = {//1.初始化Spark配置信息val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamTest")//2.初始化SparkStreamingContextval ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))//3.读取Kafka数据创建DStream(基于Receive方式)val kafkaDStream: ReceiverInputDStream[(String, String)] = KafkaUtils.createStream(ssc,"bigdata01:2181,bigdata02:2181,bigdata03:2181","StreamTest",Map[String, Int]("test" -> 1))//4.计算WordCountkafkaDStream.map {case (_, value) => (value, 1)}.reduceByKey(_ + _).print()//5.开启任务ssc.start()ssc.awaitTermination()}}
3.3. Kafka 0-8 Direct 模式(当前3.x版本不适用)
- 需求:通过 SparkStreaming 从 Kafka 读取数据,并将读取过来的数据做简单计算,最终打印到控制台
- 导入依赖:
<dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming-kafka-0-8_2.11</artifactId><version>2.4.5</version>
</dependency>
- 编写代码(自动维护 offset):
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.InputDStream
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}object StreamTest {def main(args: Array[String]): Unit = {//1.初始化Spark配置信息val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamTest")//2.初始化SparkStreamingContext,并设置CKval ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))ssc.checkpoint("./checkpoint")//3.定义Kafka参数val kafkaPara: Map[String, String] = Map[String, String](ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "bigdata01:9092,bigdata02:9092,bigdata03:9092",ConsumerConfig.GROUP_ID_CONFIG -> "StreamTest")//4.读取Kafka数据val kafkaDStream: InputDStream[(String, String)] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](ssc,kafkaPara,Set("test"))//5.计算WordCountkafkaDStream.map((_: (String, String))._2).flatMap((_: String).split(" ")).map(((_: String), 1)).reduceByKey((_: Int) + (_: Int)).print()//6. 开启任务ssc.start()ssc.awaitTermination()}}
- 编写代码(手动维护 offset):
import kafka.common.TopicAndPartition
import kafka.message.MessageAndMetadata
import kafka.serializer.StringDecoder
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.rdd.RDD
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka.{HasOffsetRanges, KafkaUtils, OffsetRange}
import org.apache.spark.streaming.{Seconds, StreamingContext}object StreamTest {def main(args: Array[String]): Unit = {//1.初始化Spark配置信息val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamTest")//2.初始化SparkStreamingContext,并设置CKval ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))ssc.checkpoint("./checkpoint")//3.定义Kafka参数val kafkaPara: Map[String, String] = Map[String, String](ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "bigdata01:9092,bigdata02:9092,bigdata03:9092",ConsumerConfig.GROUP_ID_CONFIG -> "StreamTest")//4.获取上一次启动最后保留的Offset=>getOffset(MySQL)val fromOffsets: Map[TopicAndPartition, Long] = Map[TopicAndPartition, Long](TopicAndPartition("test", 0) -> 20)//5.读取Kafka数据创建DStreamval kafkaDStream: InputDStream[String] = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder, String](ssc,kafkaPara,fromOffsets,(m: MessageAndMetadata[String, String]) => m.message())//6.获取当前消费数据的offset信息,并保存到创建的数组里// 注意:// 使用的方法为,通过transform算子将这个批次的数据转换成RDD,然后使用asInstanceOf方法将RDD转换成HasOffsetRanges,即可以获取offsetRanges// transform算子的用法是,将这个批次的DStream转换成RDD,但是transform是转换算子,所以如果没有使用行动算子,那其内部的内容不会进行运算var offsetRanges: Array[OffsetRange] = Array.emptykafkaDStream.transform {rdd: RDD[String] => {offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRangesfor (offset <- offsetRanges) {println(s"${offset.topic}:${offset.partition}:${offset.fromOffset}:${offset.untilOffset}")}rdd.flatMap((_: String).split(" ")).map(((_: String), 1)).reduceByKey((_: Int) + (_: Int))}}//7.打印Offset信息kafkaDStream.foreachRDD {rdd: RDD[String] => {for (offset <- rdd.asInstanceOf[HasOffsetRanges].offsetRanges) {println(s"${offset.topic}:${offset.partition}:${offset.fromOffset}:${offset.untilOffset}")}}}//8.开启任务ssc.start()ssc.awaitTermination()}}
3.4. Kafka 0-10 Direct 模式(3.x版本中使用此模式)
- 需求:通过 SparkStreaming 从 Kafka 读取数据,并将读取过来的数据做简单计算,最终打印到控制台
- 导入依赖:
<dependency><groupId>org.apache.spark</groupId><artifactId>spark-streaming-kafka-0-10_2.12</artifactId><version>${spark.version}</version><scope>provided</scope>
</dependency>
<dependency><groupId>com.fasterxml.jackson.core</groupId><artifactId>jackson-core</artifactId><version>2.10.1</version><scope>provided</scope>
</dependency>
- 编写代码:
import org.apache.kafka.clients.consumer.{ConsumerConfig, ConsumerRecord}
import org.apache.spark.SparkConf
import org.apache.spark.streaming.dstream.{DStream, InputDStream}
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Seconds, StreamingContext}object StreamTest {def main(args: Array[String]): Unit = {//1.初始化Spark配置信息val sparkConf: SparkConf = new SparkConf().setMaster("local[*]").setAppName("StreamTest")//2.初始化SparkStreamingContext,并设置CKval ssc: StreamingContext = new StreamingContext(sparkConf, Seconds(3))ssc.checkpoint("./checkpoint")//3.定义Kafka参数val kafkaPara: Map[String, Object] = Map[String, Object](ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG -> "bigdata01:9092,bigdata01:9092,bigdata01:9092",ConsumerConfig.GROUP_ID_CONFIG -> "StreamTest",ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer",ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG -> "org.apache.kafka.common.serialization.StringDeserializer")//4.读取Kafka数据创建DStreamval kafkaDStream: InputDStream[ConsumerRecord[String, String]] = KafkaUtils.createDirectStream[String, String](ssc,LocationStrategies.PreferConsistent,ConsumerStrategies.Subscribe[String, String](Set("testTopic"), kafkaPara))//5.将每条消息的KV取出val valueDStream: DStream[String] = kafkaDStream.map((record: ConsumerRecord[String, String]) => record.value())//6.计算WordCountvalueDStream.flatMap(_.split(" ")).map((_, 1)).reduceByKey(_ + _).print()//7.开启任务ssc.start()ssc.awaitTermination()}}
- 可以通过命令行查看Kafka对应Topic的消费进度:
bin/kafka-consumer-groups.sh --describe --bootstrap-server bigdata01:9092 --group testTopic
注:其他Spark相关系列文章链接由此进 -> Spark文章汇总