Data silos are not going away. In fact, the pandemic has accelerated the pace of digitization, and with more departments undertaking more digital initiatives, we are creating more data silos. No company intends to create silos. They form when services begin to store and process data for their own use and different storage architectures are used throughout the data lifecycle, each with its own data management.

Silos have even existed since the days of mainframes, to keep hot, active data on expensive mainframes and cold, older data and backups on tape archives. As we’ve added more price/performance layers with disk, flash, object storage and cloud, and have more vendors in the mix, data ends up in different places, sometimes for the same streams and sometimes by different users, resulting in a proliferation of silos.

Silos were easier to manage when data storage options were limited to disk and tape only; disk is online while tape is offline, so it was acceptable to use a storage vendor-centric proprietary solution to recover data from tape.

Silo management is more difficult today and becoming more urgent for many reasons:

  • Explosive data growth: About 90% of the world’s data was created in the past two years, and 80% of it is unstructured, meaning it doesn’t follow a specific pattern. This growth is accelerating, which means organizations need more efficient ways to manage data.

    IDC says 175 ZBs will be created by 2025 (Image courtesy IDC)

  • The rise from the edge to the cloud: You might be thinking, “Doesn’t everything fit in the cloud and doesn’t that eliminate silos?” The answer is not so simple. First, while there is a massive generational shift to the cloud, the cloud is not a single monolithic silo. Rather, each cloud is a robust ecosystem of multiple data storage and processing architectures, both from the cloud provider and from third parties. Take AWS, for example – it currently has more than 20 file and object storage classes and levels and a host of data analysis services, as well as third-party services such as NetApp, Snowflake, and Databrick. The challenge for enterprises is to gain visibility into the data in the various cloud compartments and datastores, and then mobilize the data so that the right data lives in the right class and tier at the right time. Meanwhile, we are only at the beginning of edge data. All the data generated by self-driving cars, autonomous systems, IoT, will lead to more data that needs to be processed and consolidated at the edge and then moved to a cloud. This leads to the multiplication of silos by locality.

But isn’t a data lake a unifier of data silos?

Weren’t data lakes supposed to replace all silos? Data lakes are attractive because unlike data warehouses, which have a strict schematic structure, a data lake can ingest any data in its native form. This means the business can move data into a data lake without any pre-processing. In contrast, this capability means that data lakes easily turn into data swamps: unstructured data of various types such as audio files, video files, genomics data, log data and documents are dumped into Lake. It becomes impossible to find anything because there is no common structure.

The data lake risks becoming a dumping ground without proper governance

Moreover, there is not a single data lake. Even if you’re on a single cloud like Azure, you probably have multiple Azure accounts, each with multiple buckets, perhaps in different classes and tiers of Azure Blob and Azure Data Lake Storage. The conceptual “data lake” is actually fragmented across hundreds of buckets, cloud accounts, and incomplete because it doesn’t contain your file data stores or on-premises data.

It’s time to embrace data silos

Data silos aren’t going away – in fact, we’re going to have more data silos. The answer is not to try to create a single new silo, but to look for solutions that can optimize, mobilize and manage data across silos so that users can search and extract data from multiple silos. of data.

Imagine if you could easily keep only the datasets you need, no matter where the data resides, if you could consistently move the right data to the right place at the right time, and if you could optimize and enrich by streaming metadata as it travels through cognitive technologies. analysis systems?

This would allow data stewards and data scientists to move files to new clouds or applications while retaining the tags needed to quickly find and segment data to feed data analytics pipelines. Storage computing is evolving to include data management and enable business outcomes rather than just managing infrastructure or imposing limits on how and where data is stored. By adopting silos, businesses can gain flexibility, cost and performance and avoid vendor lock-in, while still being able to monetize their vast stores of data.

About the Author: Krishna Subramanian is President, COO and Co-Founder of Komprise, a provider of unstructured data management solutions.

Related articles:

Great growth predicted for Big Data

Unstructured Data Growth Puts Holes in IT Budgets

Cloud Storage: A Brave New World