While helping teams leverage time series anomaly detection, we often find there’s confusion surrounding the differences between structured and unstructured data. To better understand why, let’s review which data formats the industry currently is using, and some of the challenges they pose.
Simply put, structured data typically refers to highly organized, stored information that is efficiently and easily searchable. Unstructured data is not.
In practice however, it’s not always so cut and dry. For example, if I save data in logs does that mean the data is unstructured? Not necessarily. If the data being saved in those logs is combined from well-defined Comma-Separated Values (CSV), where each field represents a well-structured format (ie. temperature or weight), you will be able to define a metadata schema for that log and process it in a structured way. Vice-versa if the data is in relational databases.
And having the data work easily with a schema doesn’t necessarily indicate that the data is well structured. In some cases, data is saved into text fields where each field actually represents a big chunk of plan text. Without an unstructured manipulation approach, you won’t be able to get value from those database-free text fields.
Data can be represented in thousands of ways. Below I’ve summarized a table of the most common structures, and the technology and companies supporting these formats.
|Typically Structured or Unstructured?||Technology Examples|
|Logs||U||Splunk, Elastic, SumoLogic||Those tools provide ways to only display the data in a structured way (tables, time series)|
|Events||U||Mixpanel||Some event platforms support both structured or unstructured formats.|
|Events||S||Google Analytics, Adobe Analytics|
|Time Series||S||Anodot, DataDog, InfluxDB, OpenTSDB||Some of these tools provide ways to convert structured and unstructured data into time series.|
|Relational||S||SQL Databases, Excel||Definitely the most common way to save company data.|
|Columnar||S||Cassandra, BigQuery, Snowflake, Redshift|
Unstructured Data: The Good and the Bad
It’s not that unstructured data doesn’t have a schema; it means that a programmer can’t analyze and work with the data without having a parsing heuristic. For example, if you need to analyze a customer chat log or responses, it is unlikely your analysis will be 100 percent accurate. Some technologies, however, like neuro-linguistic programming (NLP), or parsing with regular expressions, will help to apply the appropriate context to the free text.
In some cases, like machine data, the parsing is a collection of regular expressions that are looking for specific patterns (ie. security, failures, etc.) and in more complex cases, it might be sentimental analytics using machine learning algorithm techniques which understand emotions.
The main challenge with those structures is that the technologies which support them require a lot of discovery efforts to obtain a decent value. In cases where the logs come from machine data, it might be easier to prepare some parsing templates that represent the specific machine. But in cases when the data is generated from internal applications, then no templates can be predefined, and unstructured technologies will need to provide some sort of parsing schema. In most cases, there won’t be 100 percent parsing coverage, however the more time you invest in defining those parse expressions, the more accurate your structure conversion will become.
Here’s another scenario. If my data log contains chats of user feedback and I want to gauge the number of complaints, I could just count the number of logs which include words or phrases like “not happy” and “bad” and I should get results with reasonable accuracy. But it will never be as accurate as having users mark a checkbox that says “not happy,” and having the results saved as a well-defined event. The advantage of the logs is that you don’t need to think too much when you dump the data. Still, sometimes the effort of just taking a text record, saving it with minimum validity and preprocessing it is the only solution to cover so many use cases using one format.
It’s not uncommon for programmers to save their application data in different ways and different formats, so those logs will contain values that need to be parsed. In other cases, when unstructured data is unavoidable, and you’re working with third party systems, each with a different format and a different way to represent requests and responses, it will be impossible to gather them into a single format that expresses them all. In those cases, more effort will be put into decoding rather than encoding.
Log companies provide amazingly flexible ways to manipulate non-structured data. But, their solutions can become very expensive when they try to force you to save in a structured format. Saving a number in text is always more expensive than saving it as an integer, and if you’re sensitive to performance, then parsing is always more expensive than native manipulations.
About Structured Data
As human beings, we are always trying to be more organized in our thoughts. We assign names to people, addresses to houses, and to our communications, a string of words. In a similar fashion, we label and arrange our data to give us more powerful ways and tools to process and analyze information.
When data is organized into a well-defined structure, a schema can be defined. This leads to a better data validation process, better data quality, and better insight. For example, if you have a columnar database in which each column represents a well-defined measure as revenues and transactions, then it will be easier to build a proxy layer that validates the data that comes in. And, of course, it will be much easier to manipulate, and to run flexible operations such as sum, range checking, range indexing, aggregations, etc.
Whereas in unstructured data, you will always need to worry that a small change in the data will upset all your parsing assumptions – in structured data, this is not the case. Structured data will require more thought, design and preparatory work, and cannot match all the use cases.
Unstructured to Structured Conversion
In many cases, in order to achieve a stronger analysis of the unstructured data, you must first convert it to a structured format. Take logs, for example. They can be converted into a time series if they contain a timestamp, or to tables, if you can define a regex for each field (CSV) There are even cases when the structure data itself needs a better way to be represented in order to be better analyzed.
In the case of columnar, for instance, you might want to convert it into a time series to understand trends in the columns that represent measurements. However, each transformation might lead to some data loss and compromise the data coverage. Companies usually save a portion of the same data in other formats but will still preserve the basic, original data format as a backup.
If you can, structure it! This is the most powerful and cheapest way to work with data. In cases when you can’t cover all your data in a structured way, you might want to have different views of the same data. It might seem inefficient to save both unstructured and structured data, but in the long run it may turn out to be even more cost effective.