Welcome!

Machine Learning Authors: Carmen Gonzalez, Pat Romanski, Liz McMillan, Shelly Palmer, Ed Featherston

Related Topics: Artificial Intelligence, @CloudExpo, @BigDataExpo, @ThingsExpo

Artificial Intelligence: Blog Post

Data Unification at Scale | @CloudExpo #BigData #DataLake #AI #Analytics

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies

This term Data Unification is new in the Big Data lexicon, pushed by varieties of companies such as Talend, 1010Data, and TamR. Data unification deals with the domain known as ETL (Extraction, Transformation, Loading), initiated during the 1990s when Data Warehousing was gaining relevance. ETL refers to the process of extracting data from inside or outside sources (multiple applications typically developed and supported by different vendors or hosted on separate hardware), transform it to fit operational needs (based on business rules), and load it into end target databases, more specifically, an operational data store, data mart, or a data warehouse. These are read-only databases for analytics. Initially the analytics was mostly retroactive (e.g. how many shoppers between age 25-35 bought this item between May and July?). This was like driving a car looking at the rear-view mirror. Then forward-looking analysis (called data mining) started to appear. Now business also demands "predictive analytics" and "streaming analytics".

During my IBM and Oracle days, the ETL in the first phase was left for outside companies to address. This was unglamorous work and key vendors were not that interested to solve this. This gave rise to many new players such as Informatica, Datastage, Talend and it became quite a thriving business. We also see many open-source ETL companies.

The ETL methodology consisted of: constructing a global schema in advance, for each local data source write a program to understand the source and map to the global schema, then write a script to transform, clean (homonym and synonym issues) and dedup (get rid of duplicates) it. Programs were set up to build the ETL pipeline. This process has matured over 20 years and is used today for data unification problems. The term MDM (Master Data Management) points to a master representation of all enterprise objects, to which everybody agrees to confirm.

In the world of Big Data, this approach is very inadequate. Why?

  • Data unification at scale is a very big deal. The schema-first approach works fine with retail data (sales transactions, not many data sources,..), but gets extremely hard with sources that can be hundreds or even thousands. This gets worse when you want to unify public data from the web with enterprise data.
  • Human labor to map each source to a master schema gets to be costly and excessive. Here machine learning is required and domain experts should be asked to augment where needed.
  • Real-time data unification of streaming data and analysis can not be handled by these solutions.

Another solution called "data lake" where you store disparate data in their native format, seems to address the "ingest" problem only. It tries to change the order of ETL to ELT (first load then transform). However it does not address the scale issues. The new world needs bottoms-up data unification (schema-last) in real-time or near real-time.

The typical data unification cycle can go like this - start with a few sources, try enriching the data with say X, see if it works, if you fail then loop back and try again. Use enrichment to improve and do everything automatically using machine learning and statistics. But iterate furiously. Ask for help when needed from domain experts. Otherwise the current approach of ETL or ELT can get very expensive.

  • LikeData Unification at scale
  • Comment
  • ShareShare Data Unification at scale



Read the original blog entry...

More Stories By Jnan Dash

Jnan Dash is Senior Advisor at EZShield Inc., Advisor at ScaleDB and Board Member at Compassites Software Solutions. He has lived in Silicon Valley since 1979. Formerly he was the Chief Strategy Officer (Consulting) at Curl Inc., before which he spent ten years at Oracle Corporation and was the Group Vice President, Systems Architecture and Technology till 2002. He was responsible for setting Oracle's core database and application server product directions and interacted with customers worldwide in translating future needs to product plans. Before that he spent 16 years at IBM. He blogs at http://jnandash.ulitzer.com.

@CloudExpo Stories
While some vendors scramble to create and sell you a fancy solution for monitoring your spanking new Amazon Lambdas, hear how you can do it on the cheap using just built-in Java APIs yourself. By exploiting a little-known fact that Lambdas aren’t exactly single threaded, you can effectively identify hot spots in your serverless code. In his session at 20th Cloud Expo, David Martin, Principal Product Owner at CA Technologies, will give a live demonstration and code walkthrough, showing how to o...
In his keynote at 19th Cloud Expo, Sheng Liang, co-founder and CEO of Rancher Labs, discussed the technological advances and new business opportunities created by the rapid adoption of containers. With the success of Amazon Web Services (AWS) and various open source technologies used to build private clouds, cloud computing has become an essential component of IT strategy. However, users continue to face challenges in implementing clouds, as older technologies evolve and newer ones like Docker c...
SYS-CON Events announced today that Hitachi, the leading provider the Internet of Things and Digital Transformation, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Hitachi Data Systems, a wholly owned subsidiary of Hitachi, Ltd., offers an integrated portfolio of services and solutions that enable digital transformation through enhanced data management, governance, mobility and analytics. We help globa...
Developers want to create better apps faster. Static clouds are giving way to scalable systems, with dynamic resource allocation and application monitoring. You won't hear that chant from users on any picket line, but helping developers to create better apps faster is the mission of Lee Atchison, principal cloud architect and advocate at New Relic Inc., based in San Francisco. His singular job is to understand and drive the industry in the areas of cloud architecture, microservices, scalability ...
SYS-CON Events announced today that Juniper Networks (NYSE: JNPR), an industry leader in automated, scalable and secure networks, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Juniper Networks challenges the status quo with products, solutions and services that transform the economics of networking. The company co-innovates with customers and partners to deliver automated, scalable and secure network...
SYS-CON Events announced today that T-Mobile will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. As America's Un-carrier, T-Mobile US, Inc., is redefining the way consumers and businesses buy wireless services through leading product and service innovation. The Company's advanced nationwide 4G LTE network delivers outstanding wireless experiences to 67.4 million customers who are unwilling to compromise on ...
@DevOpsSummit has been named the ‘Top DevOps Influencer' by iTrend. iTred processes millions of conversations, tweets, interactions, news articles, press releases, blog posts - and extract meaning form them and analyzes mobile and desktop software platforms used to communicate, various metadata (such as geo location), and automation tools. In overall placement, @DevOpsSummit ranked as the number one ‘DevOps Influencer' followed by @CloudExpo at third, and @MicroservicesE at 24th.
With major technology companies and startups seriously embracing IoT strategies, now is the perfect time to attend @ThingsExpo 2016 in New York. Learn what is going on, contribute to the discussions, and ensure that your enterprise is as "IoT-Ready" as it can be! Internet of @ThingsExpo, taking place June 6-8, 2017, at the Javits Center in New York City, New York, is co-located with 20th Cloud Expo and will feature technical sessions from a rock star conference faculty and the leading industry p...
NHK, Japan Broadcasting, will feature the upcoming @ThingsExpo Silicon Valley in a special 'Internet of Things' and smart technology documentary that will be filmed on the expo floor between November 3 to 5, 2015, in Santa Clara. NHK is the sole public TV network in Japan equivalent to the BBC in the UK and the largest in Asia with many award-winning science and technology programs. Japanese TV is producing a documentary about IoT and Smart technology and will be covering @ThingsExpo Silicon Val...
NHK, Japan Broadcasting, will feature the upcoming @ThingsExpo Silicon Valley in a special 'Internet of Things' and smart technology documentary that will be filmed on the expo floor between November 3 to 5, 2015, in Santa Clara. NHK is the sole public TV network in Japan equivalent to the BBC in the UK and the largest in Asia with many award-winning science and technology programs. Japanese TV is producing a documentary about IoT and Smart technology and will be covering @ThingsExpo Silicon Val...
20th Cloud Expo, taking place June 6-8, 2017, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy.
SYS-CON Events announced today that SoftLayer, an IBM Company, has been named “Gold Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York, New York. SoftLayer, an IBM Company, provides cloud infrastructure as a service from a growing number of data centers and network points of presence around the world. SoftLayer’s customers range from Web startups to global enterprises.
With major technology companies and startups seriously embracing Cloud strategies, now is the perfect time to attend @CloudExpo | @ThingsExpo, June 6-8, 2017, at the Javits Center in New York City, NY and October 31 - November 2, 2017, Santa Clara Convention Center, CA. Learn what is going on, contribute to the discussions, and ensure that your enterprise is on the right path to Digital Transformation.
Most technology leaders, contemporary and from the hardware era, are reshaping their businesses to do software in the hope of capturing value in IoT. Although IoT is relatively new in the market, it has already gone through many promotional terms such as IoE, IoX, SDX, Edge/Fog, Mist Compute, etc. Ultimately, irrespective of the name, it is about deriving value from independent software assets participating in an ecosystem as one comprehensive solution.
SYS-CON Events announced today that Hitachi Data Systems, a wholly owned subsidiary of Hitachi LTD., will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City. Hitachi Data Systems (HDS) will be featuring the Hitachi Content Platform (HCP) portfolio. This is the industry’s only offering that allows organizations to bring together object storage, file sync and share, cloud storage gateways, and sophisticated search an...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across supply chain networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost and time for product recall as well as advance trade. Are you curious about Blockchain and how it can provide you with new opportunities for innovation and growth? In her session at 20th Cloud Exp...
SYS-CON Events announced today that Hitachi, the leading provider the Internet of Things and Digital Transformation, will exhibit at SYS-CON's 20th International Cloud Expo®, which will take place on June 6-8, 2017, at the Javits Center in New York City, NY. Hitachi Data Systems, a wholly owned subsidiary of Hitachi, Ltd., offers an integrated portfolio of services and solutions that enable digital transformation through enhanced data management, governance, mobility and analytics. We help globa...
Cloud promises the agility required by today’s digital businesses. As organizations adopt cloud based infrastructures and services, their IT resources become increasingly dynamic and hybrid in nature. Managing these require modern IT operations and tools. In his session at 20th Cloud Expo, Raj Sundaram, Senior Principal Product Manager at CA Technologies, will discuss how to modernize your IT operations in order to proactively manage your hybrid cloud and IT environments. He will be sharing be...
Quickly find the root cause of complex database problems slowing down your applications. Up to 88% of all application performance issues are related to the database. DPA’s unique response time analysis shows you exactly what needs fixing - in four clicks or less. Optimize performance anywhere. Database Performance Analyzer monitors on-premises, on VMware®, and in the Cloud, including Amazon® AWS and Azure™ virtual machines.
Five years ago development was seen as a dead-end career, now it’s anything but – with an explosion in mobile and IoT initiatives increasing the demand for skilled engineers. But apart from having a ready supply of great coders, what constitutes true ‘DevOps Royalty’? It’ll be the ability to craft resilient architectures, supportability, security everywhere across the software lifecycle. In his keynote at @DevOpsSummit at 20th Cloud Expo, Jeffrey Scheaffer, GM and SVP, Continuous Delivery Busine...