IDG Contributor Network: 3 requirements of modern archive for massive unstructured data
Perhaps the least understood component of secondary storage strategy, archive has become a necessity for modern digital enterprises with petabytes of data and billions of files.
So, what exactly is archive, and why is it so important?
Archiving data involves moving data that is no longer frequently accessed off primary systems for long-term retention.
The most apparent benefit of archiving data is to save precious space on expensive primary NAS or to retain data for regulatory compliance, but archiving can reap long-term benefits for your business as well. For example, archiving the results of scientific experiments that would be costly to replicate can be extremely valuable later for future studies.
In addition, a strong archive tier can cost-effectively protect and enable usage of the huge data sets needed for enhanced analytics, machine learning, and artificial intelligence workflows.
Legacy archive fails for massive unstructured data
However, legacy archive infrastructure wasn’t built to meet the requirements of massive unstructured data, resulting in three key failures of legacy archive solutions.
First, the scale of data has changed greatly, from terabytes to petabytes and quickly growing. Legacy archive can’t move high volumes of data quickly enough and can’t scale with today’s exploding data sets.
Second, the way organizations use data has also changed. It’s no longer adequate to simply throw data into a vault and keep it safe; organizations need to use their archived data as digital assets become integral to business. As more organizations employ cloud computing and machine learning/AI applications using their huge repositories of data, legacy archive falls short in enabling usage of archived data.
Third, traditional data management must become increasingly automated and delivered as-a-Service to relieve management overhead on enterprise IT and reduce total cost of ownership as data explodes beyond petabytes.
Modern archive must overcome these failures of legacy solutions and meet the following requirements.
1. Ingest petabytes of data
Because today’s digital enterprises are generating and using petabytes of data and billions of files, a modern archive solution must have the capacity to ingest enormous amounts of data.
Legacy software uses single-threaded protocols to move data, which was necessary to write to tape and worked for terabyte-scale data but fail for today’s petabyte-scale data.
Modern archive needs highly parallel and latency-aware data movement to efficiently move data from where it lives to where it’s needed, without impacting performance. The ability to automatically surface archive-ready data and set policies to snapshot, move, verify, and re-export data can reduce administrator effort and streamline data management.
In addition, modern archive must be able to scale with exponentially growing data. Unlike legacy archive, which necessitates silos as data grows large, a scale-out archive tier keeps data within the same system for simpler management.
2. API-driven, cloud-native architecture
An API-driven archive solution can plug into customer applications, ensuring that the data can be used. Legacy software wasn’t designed with this kind of automation, making it difficult to use the data after it’s been archived.
Modern archive that’s cloud-native can much more easily plug into customer applications and enable usage. My company’s product, Igneous Hybrid Storage Cloud, is built with event-driven computing, applying the cloud-native concept of having interoperability at every step. Event-driven computing models tie compute to actions on data and are functionally API-driven, adding agility to the software. Building in compatibility with any application is simply a matter of exposing existing APIs to customer-facing applications.
This ensures that data can get used by customer applications. This capability is especially useful in the growing fields of machine learning and AI, where massive repositories of data are needed for compute. The more data, the better—which not only requires a scale-out archive tier, but one that enables that data to be computed.
An example of a machine learning/AI workflow used by Igneous customers involves using Igneous Hybrid Storage Cloud as the archive tier for petabytes of unstructured file data and moving smaller subsets of data to a “hot edge” primary tier from which the data can be processed and computed.
3. As-a-Service delivery
Many of the digital enterprises and organizations with enormous amounts of unstructured file data don’t necessarily have the IT resources or budget to match, let alone the capacity to keep pace with the growing IT requirements of their exponentially growing data.
To keep management overhead reasonable and cost-effective, many organizations are turning to as-a-service solutions. With as-a-service platforms, software is remotely monitored, updated, and troubleshooted, so that organizations can focus on their business, not IT.
Modern archive solutions that are delivered as-a-service can help organizations save on total cost of ownership (TCO) when taking into account the amount of time it frees up for IT administrators to focus on other tasks—like planning long-term data management and archiving strategy.
This article is published as part of the IDG Contributor Network. Want to Join?
Source: InfoWorld Big Data