DAL: Making Value of Unstructured, Semi-Structured and Structured Data
Information is at the heart of infrastructure, business operations, government and our day-to-day lives. Between 80 and 90 percent of the data in existence does not adhere to any predefined model, or is in an unstructured format, according to The Data Warehousing Institute. At the same time, the amount of data available is increasing - TDWI estimated that by the year 2020, there will be 40 zettabytes of information in the world.
While such statistics make for engaging headlines, they have broader implications for developers. The information consumers, professionals and organizations produce every day is only valuable if applications can utilize it to support business logic. Thus, the Data Access Layer plays a crucial part in enabling enterprises to derive the most value from the unstructured, structured and semi-structured data they collect and create on a daily basis.
A quick overview of unstructured, structured and semi-structured data
Before discussing which possible DAL patterns developers may implement to handle different types of data, let's define exactly what unstructured, structured and semi-structured data is from an engineering perspective. Staff from Salzburg Research, an organization in Austria, detailed the difference between these three models:
● Unstructured data is simply any information that doesn’t have a pre-defined schema. This gives developers the freedom to store logs, documents, videos, files and other media without creating any schema. However, data in this form is opaque and is not easily queryable.
● Semi-structured data refers to any information that uses a self-describing schema such as XML or JSON. These types of data have an open-ended schema that enables application data flexibility. Sometimes, this type of data is combined with structured data to record additional properties for specific types of records within a structured data store.
● Structured data is any information that adheres to an established schema. These are the relational database systems with which most developers are familiar.
For the purpose of this article, we'll focus primarily on finding a DAL pattern capable of handling unstructured data.
Handling Unstructured Data
How does a developer create objects with unstructured data?
One option is to introduce a schema that accommodates the business logic. Suppose the unstructured data consists of videos. One option is to apply metadata to each video. In fact, it's possible such metadata already exists.
For example, Adobe Premiere Pro not only enables users to denote the date on which a video was created but also lists the frame rate, color saturation and other features that could be useful to application users. However, this metadata is only as valuable as it is to those interacting with the application. For instance, it's unlikely a day-to-day consumer would search a video by its frame rate.
Finding a pattern that accommodates complexity
While there are around half a dozen DAL architectures that can help developers make use of their unstructured data, the Static Repository method is a fine option to get the developer up and running quickly. Bo Griffin, one of our developers, actually coauthored a great piece about DAL patterns and dedicated a section to the Static Repository approach.
With a Static Repository, the caller doesn't have to create an instant or configure the repository representing the data before using it.