Welcome!

AJAX & REA Authors: Liz McMillan, ChandraShekar Dattatreya, Elizabeth White, David H Deans, Trevor Parsons

Blog Feed Post

The Structure of Big Data

First things first, all data is more or less structured. That being said, there is…

  • Structured Data
  • Semi-Structured Data
  • Unstructured Data

I tend to think of it as: data, composite or simple, with or without content. In that context, email is structured composite data (from, to, subject, date) with unstructured content (message body). The composite data is structured. The content is unstructured. Though simple data may or may not be structured. The ‘subject’ data is unstructured. The ‘to’ data is structured. It is composed of a local-part (username) and a domain.

While content is unstructured, there may be an implied structure.

So what is the difference between structured data, semi-structured data, and unstructured data?

Structured Data

The structure is externally enforced. Data

  • The data is stored in a database.
    • Relational
      • Transaction
    • XML
      • Catalog

The data itself is not structured. The structure is defined by the database. The data would be semi-structured if it was exported and transformed into JSON or XML.

Semi-Structured Data

The structure is self defined. Data and / or Content

  • The data is stored as text.
    • JSON
      • User Profile
    • XML
      • Application / Form

Unstructured Data

The structure of the data is externally defined. Metadata and Content

  • The data is stored in a binary format and / or document.
    • Media (e.g. Ogg).
    • Video (e.g. Vorbis).
    • Audio (e.g. Theora).
    • Image (e.g. PNG).
    • Microsoft Word
    • Adobe PDF

The structure is defined by the file format. The data is composed of structured metadata and unstructured content. That being said, a video is composed of frames and an image is composed of pixels.

  • The data is stored as plain text.
    • Log

The structure is defined by a pattern in the logging configuration file. The data is composed of structured metadata (e.g. severity) and unstructured content (log message).

  • The data is user generated.
    • Status (Facebook)
    • Tweet (Twitter)
    • Comment (WordPress)

The structure is defined by the application / form. The data is composed of structured metadata (e.g. user ID) and unstructured content (user message).

Update

I would say that the structure of content is user defined and thus interpreted. However, content is often a component of data (unstructured, externally defined). Though if I typed up this post in gedit and saved it as a text file, that might constitute content independent of data.


With all all that said, the structure of data is not exactly black and white.


Read the original blog entry...

More Stories By Daniel Thompson

I curate the content on this page, but the credit goes to my talented colleagues for the posts that you see here. Much of what you read on this page is the work of friends at How to JBoss, and I encourage you to drop by the site at http://www.howtojboss.com for some of the best JBoss technical and non-technical content for developers, architects and technology executives on the Web.

@CloudExpo Stories
Connected devices and the Internet of Things are getting significant momentum in 2014. In his session at Internet of @ThingsExpo, Jim Hunter, Chief Scientist & Technology Evangelist at Greenwave Systems, examined three key elements that together will drive mass adoption of the IoT before the end of 2015. The first element is the recent advent of robust open source protocols (like AllJoyn and WebRTC) that facilitate M2M communication. The second is broad availability of flexible, cost-effective ...
Scott Jenson leads a project called The Physical Web within the Chrome team at Google. Project members are working to take the scalability and openness of the web and use it to talk to the exponentially exploding range of smart devices. Nearly every company today working on the IoT comes up with the same basic solution: use my server and you'll be fine. But if we really believe there will be trillions of these devices, that just can't scale. We need a system that is open a scalable and by using ...
"SAP had made a big transition into the cloud as we believe it has significant value for our customers, drives innovation and is easy to consume. When you look at the SAP portfolio, SAP HANA is the underlying platform and it powers all of our platforms and all of our analytics," explained Thorsten Leiduck, VP ISVs & Digital Commerce at SAP, in this SYS-CON.tv interview at 15th Cloud Expo, held Nov 4-6, 2014, at the Santa Clara Convention Center in Santa Clara, CA.
SAP is delivering break-through innovation combined with fantastic user experience powered by the market-leading in-memory technology, SAP HANA. In his General Session at 15th Cloud Expo, Thorsten Leiduck, VP ISVs & Digital Commerce, SAP, discussed how SAP and partners provide cloud and hybrid cloud solutions as well as real-time Big Data offerings that help companies of all sizes and industries run better. SAP launched an application challenge to award the most innovative SAP HANA and SAP HANA...
How do APIs and IoT relate? The answer is not as simple as merely adding an API on top of a dumb device, but rather about understanding the architectural patterns for implementing an IoT fabric. There are typically two or three trends: Exposing the device to a management framework Exposing that management framework to a business centric logic Exposing that business layer and data to end users. This last trend is the IoT stack, which involves a new shift in the separation of what stuff happe...
DevOps is all about agility. However, you don't want to be on a high-speed bus to nowhere. The right DevOps approach controls velocity with a tight feedback loop that not only consists of operational data but also incorporates business context. With a business context in the decision making, the right business priorities are incorporated, which results in a higher value creation. In his session at DevOps Summit, Todd Rader, Solutions Architect at AppDynamics, discussed key monitoring techniques...
SYS-CON Events announced today that Gridstore™, the leader in hyper-converged infrastructure purpose-built to optimize Microsoft workloads, will exhibit at SYS-CON's 16th International Cloud Expo®, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Gridstore™ is the leader in hyper-converged infrastructure purpose-built for Microsoft workloads and designed to accelerate applications in virtualized environments. Gridstore’s hyper-converged infrastructure is the ...
The 3rd International @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that it is now accepting Keynote Proposals. The Internet of Things (IoT) is the most profound change in personal and enterprise IT since the creation of the Worldwide Web more than 20 years ago. All major researchers estimate there will be tens of billions devices - computers, smartphones, tablets, and sensors - connected to th...
Fundamentally, SDN is still mostly about network plumbing. While plumbing may be useful to tinker with, what you can do with your plumbing is far more intriguing. A rigid interpretation of SDN confines it to Layers 2 and 3, and that's reasonable. But SDN opens opportunities for novel constructions in Layers 4 to 7 that solve real operational problems in data centers. "Data center," in fact, might become anachronistic - data is everywhere, constantly on the move, seemingly always overflowing. Net...
An entirely new security model is needed for the Internet of Things, or is it? Can we save some old and tested controls for this new and different environment? In his session at @ThingsExpo, New York's at the Javits Center, Davi Ottenheimer, EMC Senior Director of Trust, reviewed hands-on lessons with IoT devices and reveal a new risk balance you might not expect. Davi Ottenheimer, EMC Senior Director of Trust, has more than nineteen years' experience managing global security operations and asse...
The 4th International DevOps Summit, co-located with16th International Cloud Expo – being held June 9-11, 2015, at the Javits Center in New York City, NY – announces that its Call for Papers is now open. Born out of proven success in agile development, cloud computing, and process automation, DevOps is a macro trend you cannot afford to miss. From showcase success stories from early adopters and web-scale businesses, DevOps is expanding to organizations of all sizes, including the world's large...
SYS-CON Media announced that Centrify, a provider of unified identity management across cloud, mobile and data center environments that delivers single sign-on (SSO) for users and a simplified identity infrastructure for IT, has launched an ad campaign on Cloud Computing Journal. The ads focus on security: how an organization can successfully control privilege for all of the organization’s identities to mitigate identity-related risk without slowing down the business, and how Centrify provides ...
The Internet of Things promises to transform businesses (and lives), but navigating the business and technical path to success can be difficult to understand. In his session at @ThingsExpo, Sean Lorenz, Technical Product Manager for Xively at LogMeIn, demonstrated how to approach creating broadly successful connected customer solutions using real world business transformation studies including New England BioLabs and more.
There's Big Data, then there's really Big Data from the Internet of Things. IoT is evolving to include many data possibilities like new types of event, log and network data. The volumes are enormous, generating tens of billions of logs per day, which raise data challenges. Early IoT deployments are relying heavily on both the cloud and managed service providers to navigate these challenges. In her session at Big Data Expo®, Hannah Smalltree, Director at Treasure Data, discussed how IoT, Big D...
The Internet of Things will greatly expand the opportunities for data collection and new business models driven off of that data. In her session at @ThingsExpo, Esmeralda Swartz, CMO of MetraTech, discussed how for this to be effective you not only need to have infrastructure and operational models capable of utilizing this new phenomenon, but increasingly service providers will need to convince a skeptical public to participate. Get ready to show them the money!
The 3rd International Internet of @ThingsExpo, co-located with the 16th International Cloud Expo - to be held June 9-11, 2015, at the Javits Center in New York City, NY - announces that its Call for Papers is now open. The Internet of Things (IoT) is the biggest idea since the creation of the Worldwide Web more than 20 years ago.
In her General Session at 15th Cloud Expo, Anne Plese, Senior Consultant, Cloud Product Marketing, at Verizon Enterprise, focused on finding the right mix of renting vs. buying Oracle capacity to scale to meet business demands, and offer validated Oracle database TCO models for Oracle development and testing environments. Anne Plese is a marketing and technology enthusiast/realist with over 19+ years in high tech. At Verizon Enterprise, she focuses on driving growth for the Verizon Cloud platfo...
SYS-CON Events announced today Isomorphic Software, the global leader in high-end, web-based business applications, will exhibit at SYS-CON's DevOps Summit 2015 New York, which will take place on June 9-11, 2015, at the Javits Center in New York City, NY. Isomorphic Software is the global leader in high-end, web-based business applications. We develop, market, and support the SmartClient & Smart GWT HTML5/Ajax platform, combining the productivity and performance of traditional desktop software ...
The Internet of Things will put IT to its ultimate test by creating infinite new opportunities to digitize products and services, generate and analyze new data to improve customer satisfaction, and discover new ways to gain a competitive advantage across nearly every industry. In order to help corporate business units to capitalize on the rapidly evolving IoT opportunities, IT must stand up to a new set of challenges. In his session at @ThingsExpo, Jeff Kaplan, Managing Director of THINKstrateg...
The Internet of Things is tied together with a thin strand that is known as time. Coincidentally, at the core of nearly all data analytics is a timestamp. When working with time series data there are a few core principles that everyone should consider, especially across datasets where time is the common boundary. In his session at Internet of @ThingsExpo, Jim Scott, Director of Enterprise Strategy & Architecture at MapR Technologies, discussed single-value, geo-spatial, and log time series dat...