mardi 10 mars 2015

Spark Streaming + Kafka: Could not compute split, block input ... not found



After digging around on Google/SO - I haven't found a good resource on this one yet.


I'm running Spark Streaming 1.2 with the Kafka consumer using Java. When I do larger input tests, about half of my stages fail with:



Job aborted due to stage failure: Task 0 in stage 505.0 failed 4 times, most recent failure: Lost task 0.3 in stage 505.0 (TID 9953, ec2-54-149-87-18.us-west-2.compute.amazonaws.com): java.lang.Exception: Could not compute split, block input-0-1426033579400 not found
at org.apache.spark.rdd.BlockRDD.compute(BlockRDD.scala:51)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:263)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:228)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)


The pipeline looks like a JavaPairRecieverInputDStream from Kafka, a flatMapToPair call that converts Kafka data into a JavaPairDStream, then a groupByKeyAndWindow to get some batches, then a foreachRDD call that writes batches out to a datastore.


All of the DStreams are set to .persist(StorageLevel.MEMORY_AND_DISK()).


The failures are exclusively in the flatMapToPair calls. For these, the failed stages have wait times ranging from 1-9 minutes in a very (suspiciously?) flat distribution.


I can't find anything about how this data might be getting cleared - am I missing something in the use case here? I could blindly do things to try to make sure all stages are running quickly, but I want to understand the mechanism that is causing this to happen if it is a result of delayed stages.


This question is the same issue, but the publisher isn't lagging behind and so respondents left of the answer. This system is definitely getting backed up, so I think this question occupies a separate space. Spark Streaming: Could not compute split, block not found




Aucun commentaire:

Enregistrer un commentaire