The following guest post is from Dr. Phil Shelley. As former CTO of Sears Holdings, Dr. Shelley has several years experience in helping move a large iconic brand into a near real-time model where digital engagement with customers has become the norm.
There is a changing approach to data management as we evolve the use of big data solutions. Historically, data warehouses were loaded with partial data from transactional systems, frequently archived and contained modest data volumes. In the event of system or procedural issues, the legacy data warehouses could be reloaded from backups, the source transactional systems or from transactional backup archives, giving us the ability to reconstruct the data warehouse from those transactional systems’ records.
The enterprise data hub (EDH) running on Hadoop is a data management and analytics solution that is rapidly gathering pace and displacing legacy data warehouses and even mainframes. An important concept of an EDH is to load data in near real-time and to retain full fidelity and full detail, preferably forever. This “no ETL” concept is a primary feature of an EDH that makes it so flexible, cost effective and powerful, especially for analytics.
Over time, however, as more work is moved to an EDH, data backup, recovery and high availability become growing requirements. The ability to restore full fidelity data from transactional systems may not be possible, or becomes impractical. As a result, new paradigms in data management practices are rapidly emerging. These include dual ingestion, active-active Hadoop, distributed file systems, orchestration of data movement between systems, backup and copy using multiple Hadoop clusters and more recently more advanced solutions for backup snapshots, rollbacks, recovery and restoration of time-based changes.
In my personal experience as CTO at Sears and since then helping other large firms implement these solutions, I have seen a pattern emerge where a single Hadoop cluster rapidly grows in size and importance to company operations. Typically within 2 years, IT leaders, like I did, realize that a more comprehensive data management strategy is essential. There are good solutions now that bring enterprise-class resilience to the EDH concept.
It will be interesting to see how these new technologies and approaches mature to provide highly available and redundant enterprise data solutions using big data technologies, while retaining the advantages of the EDH. Today we can benefit from the big data EDH approach, with almost unlimited low-cost compute, storage and analytics versatility, with data loaded in near real-time, but increasingly with the advantages of legacy systems data protection.