Robust Workflow Management for Big Data Workloads


Big Data refers to large-scale structured, semi-structured, or unstructured data sets whose manipulation and management present significant logistical challenges. Most of the Big Data platforms (Hadoop, NoSQL, etc.) run on top of several commodity servers, referred to as a cluster. Generally, organizations have one production cluster and multiple smaller-sized clusters used by test engineers, developers, and data scientists. They often move hundreds, if not more, of terabytes of data between these clusters.


Organizations typically follow a “do it yourself” model, whereby several end-users implement their own workflows for moving data. In most cases, data must be protected before being moved across clusters. Two examples:(a) personally identifiable information (PII) from the production data needs to be protected by masking before being made available to test/dev personnel; and (b) a sample of the production data must be moved to a different cluster. Workflows like these become increasingly complex to develop and maintain over time. Delays incurred while building new workflows or fixing existing workflows for configuration changes result in downtime for end users, costing time and money.


So developers or IT administrators must put significant effort into scripting workflows to meet the following crucial challenges in the Big Data world:


  • Restartability
  • Resilience
  • Flexible policies — event driven or timely
  • Chained sub-workflows
  • Maintainability
  • Fair-share system resources




A typical Big Data workflow consists of several discrete actions, including taking snapshots across multiple production nodes, copying data, indexing, masking sensitive data, and more. Each step is fairly lengthy and resource intensive. Hence, the workflow manager must be able to check-point each step and continue from that point if the workflow fails, for example, from a failure of underlying hardware like a physical node or a network.




Most workflows must be programmed with a significant amount of configuration information such as data objects (in large numbers), IP addresses of nodes in the cluster, frequency of the workflow, trigger points, and other similar information. Since Big Data platforms have a scale-out architecture and run on top of commodity hardware, each component in the ecosystem must be able to survive multiple node failures. A workflow manager is no exception. Hence, typical scheduling engines with a single point of failure are not suitable in their raw form to meet the expected level of resilience.


Flexible policies


Unlike traditional data management or data copy workflows, which are typically characterized by daily or weekly full-copy policies, Big Data workflows need to be much more flexible in scheduling. Production clusters are used 24×7, and the only reasonable time for scheduling a backup or data copy to another cluster may be after a data load operation is complete or a batch process completes execution. So in addition to workflow control through the regular scheduling options, certain event(s) must be able to trigger the workflows.


Chained sub-workflows


Most of the workflows in a Big Data world comprise multiple independent sub-workflows. Each sub-workflow can be a fairly complex directed acyclic graph (DAG) and may have to be scheduled at a different frequency than that of other workflows in the entire chain. To further complicate matters, there is a many-to-many mapping between the output of the sub-workflow(s) in the chain and their peers that run later in the chain.




Workflows must be regularly updated for changes in configuration details or possibly amended to handle changes in the target data set. For example, consider a typical Hive table with several dynamically created partitions. Users needing to ensure that any newly created data objects are automatically covered by the workflows will have to specify nested regular expressions. Users deserve a simple and quick way to edit the workflow configurations as and when needed.


Fair-share system resources


All Big Data implementations have a scale-out architecture and run on top of several independent hardware nodes or virtual machines (also referred to as nodes). It is crucial for optimum performance that system resources be shared fairly across all the nodes. In addition, priorities of the workflows must be honored during allocation of resources in order to favor execution of high-priority workflows over those with lower priority.


Imanis Data has developed a truly web-scale Workflow Manager that addresses the above requirements. Our intuitive graphical user interface enables users to configure complex workflows with just a few clicks. Users can build sophisticated workflows without any scripting. We ensure that workflow configurations are available even after multiple node failures. With the Imanis Data Workflow Manager, users can specify priorities of workflows, and to each workflow the Workflow Manager dynamically assigns resources from various nodes. We also include representational state transfer (REST) APIs for the Workflow Manager so that users can integrate Imanis Data workflows with the ones on the production cluster, and can trigger the workflows on Imanis Data as dictated by particular events.

Sign Up To Receive Imanis Data Updates

Take the Next Step

Put Imanis Data to work for all your data management needs.