High-availability in scale-out stream processing

Presenter: Marios Fragkoulis
Date: 17 December 2018

Abstract

Data stream processing offers low latency processing of bounded and unbounded data sets with strict semantics over the time dimension of data and consistent fault-tolerant operation. The applicability of data stream processing and the maturity of current stream processing systems (SPS) has produced numerous and large scale deployments around the world for commercial and other important use cases, such as fraud detection, risk analysis, and disaster prediction.

When processing unbounded data for commercial or critical purposes the capability to recover from failures quickly, if not transparently, is important. In this work in progress we study three different recovery configurations: - restart recovery, which restarts a stream processing job in case of an operator failure from the latest checkpointed state - standby task recovery, which substitutes a failed operator with a standby instance - process pair, which runs two coordinated instances of a stream processing job in order to switch from one to the other in case of failure. Through empirical experiments we plan to map the tradeoffs between availability and resource utilization between the recovery configurations.