In the information age, data is available to everyone who wants it. Businesses survive, thrive, win and lose on their ability to make sense of this data. Today, evolving and emerging technologies such as AI, blockchain, and IoT are the subject of discussion in corporate boardrooms, just because of their ability to generate data and insights that can transform the way business is done. The race is on to create agile and customer-centric organizations where innovation and sales are driven by real-time data.
Businesses that will eventually win this race are those that can break through organizational silos, upgrade legacy IT infrastructure, and create efficient operational processes by making their data and analytical assets accessible to all personnel. Needless to say, those that don’t will be left behind.
“Lots of companies are embracing data and analytics now because they’re scared to death of being disrupted by digital companies that use data and digital technologies to remake entire industries,” said Wayne Eckerson, Founder of BI consultancy Eckerson Group.
A survey by the Harvard Business Review found that there was a clear consensus among leaders on the capabilities that constitute a data-driven enterprise. Apart from securely accessing data, deploying and scaling analytics, and extracting valuable insights from existing data and apps, three inter-related priorities stood out:
- Collect and combine data from a variety of external sources
- Create a single version of the truth from this data
- Deliver customized, actionable intelligence to all employees
Central to the achievement of these goals is the integration of structured and unstructured data from different sources and facilitating transparent storage, retrieval, and manipulation of this data.
Data virtualization is the means to this end.
What is Data Virtualization and How Does It Work?
Most big and small companies now access, collect, and store data in multiple formats, such as emails, social media posts, spreadsheets, logs, and so on. Add to that the databases that their applications generate in different formats. It is difficult to manage this data and facilitate sharing and access between multiple apps. This problem is compounded for enterprises, which need to juggle multiple systems, models and environments that involve heterogeneous data.
Data virtualization overcomes these limitations by allowing users to retrieve and manipulate data without knowing where it is stored or how it is formatted.
Data virtualization is the creation of a single, virtual, abstract layer that integrates data from multiple sources, across multiple applications, formats, and physical locations and presents it to users without the need for data replication or movement. It connects to different databases, creates virtual views of disparate data, and publishes them as a service such as REST.
This enables quicker and easier data access without performing an Extract, Transform, Load (ETL) process or using a separate platform or interface. Further, users, devices and applications can be given restricted access to specific chunks of data for analysis and reporting, in accordance with organizational policies.
The market leading data virtualization solutions today are Denodo, SAS Federation Server, Red Hat JBoss, and TIBCO Data Virtualization.
A Unified, Logical View of Data
Most enterprises have built, acquired or otherwise come to own dozens or even hundreds of information silos over the years, ranging from spreadsheets to operational databases. Each has its own structural framework, or schema, although some have no structure at all.
“Data silos are a serious business problem. We gather machine data in one place, security data in another, customer experience and support data in another – the list goes on. This structure may make sense at a departmental level, but it prevents collaboration necessary to ensure competitiveness. Companies need an operational data layer that is core to business processes and supports data sharing,” wrote Walter Scott in a Forbes Technology Council post.
Data virtualization helps organizations harmonize data scattered across various departments and even externally on apps, social media, and across the web, without incurring additional collection costs of building data warehouses or amassing raw data on sprawling data lakes.
“Data virtualization creates a logical view of multiple data sources without requiring the organization to replicate data and try to homogenize it into a single source,” explained Mike Wronski, former Director of Product Marketing at Nutanix.
Reduced Complexity in Accessing and Analyzing Data
When data virtualization was in its infancy, it involved using federated databases with strict data models and caching results for better performance. Yet, engineers had to manually specify and enforce the underlying data structures and sources. This approach is outdated and impractical today, because of sheer amount of big data available and generated by the ecosystem.
Eventually, it became possible to cost-effectively combine data from many different sources – including unstructured ones like email and Twitter conversations – into a single repository or data lake for analysis. However, doing so required organizations to copy large amounts of data into new repositories that were hard to structure, update, and govern. Over time, the cost-effectiveness of the process was lost to rising storage and processing expenses as well as complexity, prompting some observers to declare that “data lakes are turning into data swamps.”
This complexity became the principal driver of data virtualization. “Today, you don’t have just three or four databases. You might have 40, including relational, text, graph, search, and key-value stores, both on-premises and in the cloud,” said Matthew Baird, VP of C3.
“Today’s technology spiders across networks to discover data at the source, uses machine learning to interpret query results, and optimizes the schema accordingly,” he continued. “It’s an autonomous process that understands enough of the underlying infrastructure to do what data engineers would do. You tell us what you have, and we figure out the best way to use it.”
Data virtualization enables queries to span many data sources simultaneously, while appearing to the user to be a single, cohesive resource. The data itself never moves, which pays off in reduced complexity, fewer errors, and savings on servers, storage and bandwidth.
What’s more, the addition of an abstraction layer doesn’t extract a performance penalty at all. Just like VMs enhance the performance of bare-metal hardware with server and storage virtualization, data virtualization architectures improve response times by processing queries in parallel across underlying source data and consolidating the results before presenting them.
Key Benefits of Data Virtualization
“There’s room to virtualize at nearly every level of a company because no one ever has only one database,” declared Baird.
The bigger payoff, however, comes from giving everyone who needs access to data a single gateway, a single catalog, a single way to authenticate. “Gateways that understand enterprise-wide needs have enormous value,” he added. “You know which users in which locations are querying which data and driving outcomes.”
That has contingent benefits in providing a unified view of how data is being used across the enterprise – information that organizations can use to allocate their storage and data resources more efficiently. Some of these are:
- An infrastructure-agnostic environment enables integration of disparate data sources, lowers data redundancy, and saves operational costs.
- Multi-faceted data delivery is possible as organizations can publish datasets as data services or data views, as requested by the user or in the format of the application that is using the data.
- Multi-source and multi-mode data access makes it simple for users in different roles to work on data on an on-demand and as-needed basis, ensuring information agility and speeding up decision making in business settings.
- Hybrid query optimization lets users create and run customized queries for scheduled-push, demand-pull, and other advanced types of data requests.
- Strong security and data governance keeps data safe from unauthorized access from unauthorized users, devices and locations. It is also possible to isolate or decouple a particular data source that is questionable or non-compliant from the pool.
- Improved data quality and reduced data latency as a result of meta data storage and creation of reusable virtual data layers by modern data virtualization tools, which apply data transformation logic to the presentation layer. Since there is no need for replication, the risk of outdated or inconsistent data is eliminated and there is no need for repeated extraction either.
- A simplified architecture masks the complexity of underlying processes and streamlines application development and maintenance. It also allows for easier integration of new cloud sources with the existing IT infrastructure.
Use Cases of Data Virtualization
The data virtualization market is flourishing – research from Facts and Factors shows that it is poised to surpass $8.5 billion by 2026. Growth has been fueled in part by disillusionment with Hadoop (the distributed data processing framework widely credited with igniting the early adoption of big data in the enterprise) and the rise of Kubernetes and cloud-native solutions.
Here are some common data virtualization use cases in the enterprise:
Improves logical data warehouse (LDW) functionality: Unlike a traditional data warehouse, data isn’t stored in an LDW, so there’s no need to prepare or filter data. Data virtualization simply federates queries across data stores and repositories as well as applications, such as data warehouses, data lakes, web services, Hadoop and NoSQL and transparently presents data views to end users. It uses common protocols and APIs like REST, ODBC and JDBC to facilitate quick data transfer. Admins can also assign workloads automatically to ensure SLA compliance.
Enhances Big Data analytics: Big data and predictive analytics depend on the real-time use of heterogeneous data sources such as email, social media, IoT devices, and so on, which require real-time integration and interaction with multiple devices, platforms, and applications. Data virtualization enables logical views of derived data for advanced querying and analytics. Quick integration with BI tools and analytics apps ensure information agility.
Facilitates shared data access: One of the major causes of information silos in an organization is that different departments use dissimilar systems that don’t talk to each other. For example, banks set up different call centers and customer service departments for processing home loans, credit cards, and retail banking. Normally, developers need to write extended lines of code to enable data sharing between these functions. This may also require complex data transformation operations and database mapping. Data virtualization makes sure that everyone from customer service execs to regional managers can access the entire portfolio of a customer from multiple data stores and see it on a unified dashboard.
Faster data integration: The primary purpose of data virtualization is to optimize the enterprise warehouse to gather, store, transfer, and transform massive amounts of data for query and analysis. Traditional data integration methods such as ETL are good for bulk data movement but they are time-intensive and performance buckles under the volume of big data. Data virtualization understands each data source and takes a surgical approach to homogenization and movement of data. Developers can use a global namespace with intelligent caching and in-memory metadata to logically integrate data at the application level rather than the storage level. This means only data that is absolutely needed for the query or transaction is moved and processed.
Data Drives Digital Transformation… Virtually
Almost every business and organization is shifting from a product-centric or service-centric model to a customer-centric one. The ubiquity of data today means that the consumer has access to almost as much information as the business itself, resulting in ever-increasing demands.
As organizations struggle to handle increasing volumes of data, technologies like virtualization will let them view data as a strategic asset instead of a functional resource. Without investing in mission-critical data and getting it to the hands of the right teams at the right time, businesses are doomed to stagnation and failure. It is here that CIOs and CTOs need to take the lead in enhancing organizational data readiness and empowering employees with data, analytics and technology to drive strategic, long-term success.
This is an updated version of the original article that was published on March 10, 2020.
Dipti Parmar is a marketing consultant and contributing writer to Nutanix. She writes columns on major tech and business publications such as IDG’s CIO.com, Adobe’s CMO.com, Entrepreneur Mag, and Inc. Follow her on Twitter @dipTparmar and connect with her on LinkedIn.
© 2021 Nutanix, Inc. All rights reserved. For additional legal information, please go here.