【Spark精讲】RDD缓存源码分析

面试题：cache后面能不能接其他算子，它是不是action操作？

能，不是action算子。

源码解析

RDD调用cache或persist之后，会指定RDD的缓存级别，但只是在成员变量中记录了RDD的存储级别，并未真正地对RDD进行缓存。只有当RDD计算的时候才会对RDD进行缓存。

以HadoopRDD为例

    override def compute(split: Partition, context: TaskContext): Iterator[U] = {val partition = split.asInstanceOf[HadoopPartition]val inputSplit = partition.inputSplit.valuef(inputSplit, firstParent[T].iterator(split, context))}

调用的iterator方法

  /*** Internal method to this RDD; will read from cache if applicable, or otherwise compute it.* This should ''not'' be called by users directly, but is available for implementors of custom* subclasses of RDD.*/final def iterator(split: Partition, context: TaskContext): Iterator[T] = {if (storageLevel != StorageLevel.NONE) {getOrCompute(split, context)} else {computeOrReadCheckpoint(split, context)}}

继续看 getOrCompute方法：这里可以看到blockId的生成规则，可以确定block和partition是一一对应的。

@DeveloperApi
case class RDDBlockId(rddId: Int, splitIndex: Int) extends BlockId {override def name: String = "rdd_" + rddId + "_" + splitIndex
}

在executor端调用SparkEnv.get.blockManager.getOrElseUpdate()方法，

  /*** Gets or computes an RDD partition. Used by RDD.iterator() when an RDD is cached.*/private[spark] def getOrCompute(partition: Partition, context: TaskContext): Iterator[T] = {val blockId = RDDBlockId(id, partition.index)var readCachedBlock = true// This method is called on executors, so we need call SparkEnv.get instead of sc.env.SparkEnv.get.blockManager.getOrElseUpdate(blockId, storageLevel, elementClassTag, () => {readCachedBlock = falsecomputeOrReadCheckpoint(partition, context)}) match {case Left(blockResult) =>if (readCachedBlock) {val existingMetrics = context.taskMetrics().inputMetricsexistingMetrics.incBytesRead(blockResult.bytes)new InterruptibleIterator[T](context, blockResult.data.asInstanceOf[Iterator[T]]) {override def next(): T = {existingMetrics.incRecordsRead(1)delegate.next()}}} else {new InterruptibleIterator(context, blockResult.data.asInstanceOf[Iterator[T]])}case Right(iter) =>new InterruptibleIterator(context, iter.asInstanceOf[Iterator[T]])}}

再看BlockManager中的getOrElseUpdate方法，用来缓存数据的

  /*** Retrieve the given block if it exists, otherwise call the provided `makeIterator` method* to compute the block, persist it, and return its values.** @return either a BlockResult if the block was successfully cached, or an iterator if the block*         could not be cached.*/def getOrElseUpdate[T](blockId: BlockId,level: StorageLevel,classTag: ClassTag[T],makeIterator: () => Iterator[T]): Either[BlockResult, Iterator[T]] = {// Attempt to read the block from local or remote storage. If it's present, then we don't need// to go through the local-get-or-put path.get[T](blockId)(classTag) match {case Some(block) =>return Left(block)case _ =>// Need to compute the block.}// Initially we hold no locks on this block.doPutIterator(blockId, makeIterator, level, classTag, keepReadLock = true) match {case None =>// doPut() didn't hand work back to us, so the block already existed or was successfully// stored. Therefore, we now hold a read lock on the block.val blockResult = getLocalValues(blockId).getOrElse {// Since we held a read lock between the doPut() and get() calls, the block should not// have been evicted, so get() not returning the block indicates some internal error.releaseLock(blockId)throw new SparkException(s"get() failed for block $blockId even though we held a lock")}// We already hold a read lock on the block from the doPut() call and getLocalValues()// acquires the lock again, so we need to call releaseLock() here so that the net number// of lock acquisitions is 1 (since the caller will only call release() once).releaseLock(blockId)Left(blockResult)case Some(iter) =>// The put failed, likely because the data was too large to fit in memory and could not be// dropped to disk. Therefore, we need to pass the input iterator back to the caller so// that they can decide what to do with the values (e.g. process them without caching).Right(iter)}}

本文来自互联网用户投稿，该文观点仅代表作者本人，不代表本站立场。本站仅提供信息存储空间服务，不拥有所有权，不承担相关法律责任。如若转载，请注明出处：http://www.mzph.cn/news/594172.shtml

如若内容造成侵权/违法违规/事实不符，请联系多彩编程网进行投诉反馈email:809451989@qq.com，一经查实，立即删除！