Now-a-days everyone is talking about Big Data. So what exactly is this Big Data? It is huge volume of data, mostly unstructured, generated at a high velocity. The data can be of any kind like messages shared on social media, data on climatic changes, photos and videos to name a few. Analyzing these data can help companies to target new customer base or improve the experience of existing customers. However, the problem is that the sheer volume of the data makes it very difficult to process it by conventional frameworks.
What is Storm?
In this first part of the series, we will introduce ourselves to Storm. In the subsequent series, we will delve into deploying Storm and understand it better with help of working examples.
Storm is a distributional real-time computational framework developed by Back Type and now acquired by Twitter. It provides a framework to process unbounded amount of data in real time. There are other frameworks, most notably Hadoop, which provides the framework to process big data. Essentially, Hadoop and Storm are very similar with the difference that Hadoop processes the data in batches and Storm does it real-time.
Storm processes the data by taking advantage of clustering multiple machines (nodes), computing each individual data, called tuple, It distributes these tuples to different notes and monitors the processing of each tuple. It makes sure that every tuple is processed even if any of the nodes fail. In case of a node failure while processing a tuple, it is passed to the next available node. These features make Storm extremely scalable and fault tolerant.
Storm is designed to run continuously to process the continuous flow of data. This feature makes it absolutely important that the clusters created by Storm should be fail-safe and stable.
Storm clusters consists of two types of nodes – a master node and worker nodes.
Master nodes run a daemon called ‘Nimbus’. Nimbus assigns distributes the tasks among the nodes and monitors them for completion or failure.
Every worker nodes runs a daemon called ‘Supervisor’. Supervisor listen to the task assigned to the node and starts the worker processes.
The coordination between the Nimbus and Supervisors is managed by a Zookeeper cluster. The Zookeeper holds the state of the task. This lets the Nimbus and Supervisor daemons to be state-less. If the Nimbus and Supervisor processes do get killed during the lifecycle, Storm can bring them back up and they can carry on with their tasks. Storm uses Apache Supervisor to achieve this feature, which makes the clusters fail-fast and stable.
In Storm, everything starts from a Topology. A Topology defines the components that need to be present in a cluster. The number of instances of each component is also defined in the topology. More importantly topology shows how each component is linked together in the cluster. Topologies can be run in the local mode or the submitter mode. We use local mode for development. However, in production mode, the topology needs to be submitted to Nimbus in the submitter mode. This brings all the components in the framework into picture to give the robustness and scalability to the Storm cluster.
In Storm, the data consumed is called ‘Stream’. If we break down the data, then each individual stream of data is called a ‘tuple’. That makes Stream an unbounded sequence of tuples. Basically, each component that processes a tuple transforms the data before passing it to the next component. The components involved in a Topology are ‘Spouts’ and ‘Bolts’.
Spout is a source of streams. Spout would be fetching the data which needs to be processed and then it forwards to Bolts. Bolts are components which accept a single tuple and operate on that and then could emit to other bolts depending on the chaining declared in the topology.
A deeper look into the components
Let us look at a Topology and how to declare one.
TopologyBuilder builder = new TopologyBuilder();
builder.setSpout(“SpoutA”, new MySpoutA(), 3);
builder.setBolt(“BoltA”, new MyBoltA(), 5).shuffleGrouping(“SpoutA”);
builder.setBolt(“BoltB”, new MyBoltB(), 10).shuffleGrouping(“SpoutA”);
builder.setBolt(“BoltC”, new MyBoltC(), 10).shuffleGrouping(“BoltB”);
This is how a typical Topology is declared. We use an object of ‘TopologoyBuilder’ to declare a topology. In the above example, the topology consists of one Spout and three Bolts.
Here, BoltA and BoltB are chained after SpoutA and BoltC is chained after BoltB. Any tuple emitted from SpoutA will be received by BoltA and BoltB. Similarly, tuples emitted from BoltB will be received by BoltC. While declaring a Spout or a Bolt, the number of parallel instances can also be declared. In the above example, we have three instances of SpoutA, five instances of BoltA, ten instances of BoltB and BoltC.
A topology needs to be submitted. There are two ways of submitting a topology, in local or submitter mode.
Local mode is used while development where we need not go through installing all the other components to run Storm.
Config conf = new Config();
LocalCluster cluster = new LocalCluster();
cluster.submitTopology(“TopologyA”, conf, builder.createTopology());
For deployement, we need to bring in all the components like Zookeeper, Nimbus and Supervisors in the picture. For that the topology needs to be submitted in in the Submitter mode. Following declares how to submit topology in the Submitter mode.
Config conf = new Config();
StormSubmitter.submitTopology(“TopologyA”, conf, builder.createTopology());
Spout and Bolt Implementation
All Spouts developed in Storm implement the interface ‘IRichSpout’ and all Bolts implement ‘IRichBolt’. These interfaces have methods that start and stop the component, declares the fields that would be emitted from it, process the tuple and acknowledge it. ‘BaseRichSpot’ and ‘BaseRichBolt’ are the out of the box implementation of ‘IRichSpout’ and ‘IRichBolt’ respectively which implements the basic methods like starting and stopping the component, acknowledging the tuple etc.
Storm is a powerful framework which fit in specific requirements where large amount of continuous data needs to be processed. Even though it is in a nascent stage, Storm is evolving to solve the problem of ever growing data that needs to be processed.
Now that we understand what Storm is, in the next part of series we will see how to write a storm cluster and deploy it in both local and production mode.