As enterprises deploy in greater numbers big data platforms like NoSQL, Hadoop and enterprise data warehouses, it is clear that certain data management architectures limit a company’s effectiveness to protect and manage petabyte-scale data sets. Effective backup, recovery, test/dev management, and archiving for big data sources require a different set of architecture principles than would be the case with traditional data sources like Oracle or SQL Server. Here are four elements that are counter-intuitive when building a data management architecture:
- Agents on your primary cluster. Companies face three big issues when deploying agents. First, big data clusters are deployed on commodity hardware/storage and are characterized by constant commissioning and de-commissioning of nodes. The IT staff has to monitor for failures of nodes and of the agents themselves, creating extra operational overhead. Second, agents consume resources and impact the latency of your production environment. Third, agents need to fit into the security posture of the company creating another layer of potential vulnerability.
- Separation of storage from compute. The primary benefits of integrating storage and compute are to minimize data movement and maximize storage savings. Big data applications built on top of NoSQL and Hadoop infrastructures typically require customers to use direct-attached storage instead of network storage such as SAN and NAS devices. Petabyte-scale data environments without integrated compute and storage leads to significant bandwidth costs, unnecessary storage expenses, and slower workflow completion times.
- Inability to scale with your primary cluster. Plan for success. If your production nodes double in number and size, your data management platform should scale linearly with your production cluster to ensure predictable performance, effective management of larger amounts of data, and continued storage optimization.
- Lack of data masking and sampling. To expedite application development, companies often want actual production data made available in test and dev environments, typically in self-service mode. However, implementing test/dev management workflows without the means and processes to adequately obfuscate personally identifiable information (PII) or other sensitive data puts companies at compliance risk. These dev/QA environments are also significantly smaller than the production cluster, making down sampling a necessity. Down sampling also provides an ancillary benefit in the form of lower bandwidth costs.
This list is by no means exhaustive. For example, it doesn’t include architecture mistakes companies make around storage optimization. However, these four areas are especially relevant when implementing the ideal data management architecture. We encourage you to read our architecture white paper to understand how we’re addressing these issues and enabling companies to take full advantage of their big data assets.