You are here

You are here

Essential data types: What you need to know

Mike Perrow Technology Evangelist, Vertica

Big data is a big market, already valued worldwide at nearly $200 billion and expected to grow to $274 billion by 2022. What's driving this is the growing realization by many medium-size to large businesses that a wealth of insight lies hidden in data that was, until recently, too voluminous to access and make use of.

Today, specially built applications and nontraditional data storage and analysis methods have made huge datasets—based on transaction logs, email, sensor readings, even photographic and video sources—accessible to organizations looking to find meaning in a sea of information that was, for the most part, previously ignored.

There's a lot of data out there, and it's big and unwieldy. But a set of best practices has started to emerge when it comes to making the most of corporate data. Here are the challenges, and what data scientists mean when they talk about datasets.

Data types and the drive to big data analytics

Relational databases started to gain traction in the 1980s, when businesses began to exploit new data technologies for managing customer records and hundreds of related requirements. Relational databases standardized many elements in customer records, which only a few years before had been paper-based and stored in physical file folders, in whatever format a business felt like using. There was no widespread agreement about the types of data collected or the formats used.

As businesses and their partners, auditors, and consumers began requiring access to computerized data records, structured data types made it easier than it had been for these records to be shared and understood by systems operated by different organizations. It's like soccer (or "futbol," as it’s known, with this and other spellings, outside the US): the game can be played anywhere in the world, and no matter what country the players are from, they will behave according to rules that everyone, everywhere, understands.

The advent of structured data made data storage easier, and the data itself became more valuable. But these were hardly the good old days of data management. Structured data represented only a part of the information being generated by increasingly digitized communication modes, including music, photos, videos, text, geospatial data, metadata, and so on.

None of these nonstandard data types could be managed by a relational database in their native form. Those are examples of unstructured data—really big records, or files, with undisciplined boundaries.

So, as Moore's Law predicted back in 1965, the essential elements of computing got smaller and cheaper. And data management gradually began to embrace new capabilities for handling all these new data types. Hardware and software came along to help consumers with the changes.

Today, data is usually one of three basic types. 

Structured data

This type of data can be neatly proportioned in rows and columns, reflecting, for example, name, address, phone number, and so on. Sources of structured data include database files, log files generated by IT operations, security systems activity, Internet of Things data, and much more. In most big data applications, data must be transformed into a structured format before it can be analyzed.

Note that transactional data, the sort of records that result from a sale, are usually structured. Item, price, margin, customer name, etc. are all data elements that can be stored in a relational database for use in reporting, trending, planning, and so on. That’s something businesses have been doing for decades. Machine data derived from sensors to record heat, wear and tear, pressure, etc. is also structured data. That includes data from IoT devices.

Semi-structured data

This consists of metadata that identifies various aspects of a given file. This can include structured elements, such as name, address, and date, but some of the file’s content also might be unstructured. Email is one example. Other examples include script language files that may be partly human-readable, with break points such as tags or other semantic elements used by computer applications.

Unstructured data

Have you ever tried emailing a friend a copy of a video or a music file and had your email system fail in the process? Data types that are temporal in nature (i.e., require one or more seconds to consume, such as a song) or data types that require some kind of spatial rendering (such as a photo or video) are examples of unstructured data. The files are huge, exponentially larger than files based on structured data.

Unstructured data presents several practical problems in terms of usefulness, at least historically. Consider the sheer size of the data being stored. How do you sort through potentially hundreds of terabytes of information to find anything useful? There's way more noise than signal.

Also, consider that unstructured data records can't be easily parsed according to, say, name, address, phone number, state, etc. It's all vectors and instruction sets and libraries that each element requires to be rendered in its original state. But it's incredibly useful. 

How these data types are related

Consider some of the new developments over the past 20 years that have put big data analytics on the map:


  • The availability and affordability of massive data storage, which makes it possible to manage very large datasets in a single, logically defined location. Massively parallel processing (MPP) can be based on these very large arrays of commodity hardware, and as you need more compute or storage space, you add more hardware for relatively little money. The cloud has introduced even more options for virtually limitless data storage.
  • This ability to store vast quantities of data has put unstructured data into play for analytics. Not long ago, unstructured data was thought unsuited for big data analysis. How do you make sense of photos, music, unbroken streams of voice data, and other such data types? Today, unstructured data is part of the mix; more on this later.
  • The IoT has matured to the point where machine and sensor data makes up a huge percentage of all data that can be stored and analyzed. Plus, that data has the advantage of arriving in a structured format, ready for analysis.

Put these developments together, and you get scenarios that define today's most common use cases for big data analytics.

A real-world example

A customer support center handles hundreds of calls each week, and makes voice recordings that go straight to the company's data lake; this becomes a sea of information that gets captured for some usually undefined later purpose. While individual call records are retrievable and listenable, no one has the time or, frankly, the ability to listen to hundreds of recordings to discover any themes within customer issues.

Yet learning those patterns can be vital to improving customer success and repeat business. So instead of simply dumping call records straight to the data lake, imagine that each call is transcribed into text, with metadata such as time stamp, length of call, rating of call effectiveness, class of problem, and a host of other ways to describe that interaction between the support center and the customer.

The formerly unstructured data is now structured and can be moved to a database for querying and analysis.

When data becomes more understandable, organizations can find patterns. In this case, say, the analytics team may find that customers tend to have problems using a feature in product X whenever some user configuration is set to Y and Z.

That set of dependencies and its impact on an otherwise happy customer might not have been discovered without the ability to correlate thousands of call records made by hundreds of call center personnel over a matter of months.

Enter the age of analytics

This was a fairly simple example, but it points to the reasons why massive data storage has become important to business operations, and how big data techniques can pay off for organizations that are routinely storing data from many sources in a variety of formats.

Doing something useful with your data is not about retrieving individual records and reviewing them one by one. Rather, effective use of big data is all about figuring out how the elements of one record correlate to those of another. Analytics is the name of the game.

Keep learning