Welcome!

Machine Learning Authors: William Schmarzo, Ed Featherston, Liz McMillan, Elizabeth White, Dan Blacharski

Related Topics: @DXWorldExpo, Machine Learning , @ThingsExpo

@DXWorldExpo: Blog Feed Post

Natural Language Processing | @BigDataExpo #BigData #Analytics #DataScience

Bridging the Gap Between Big Data and Big Information

Apophenia is the propensity to see patterns in random data.”  We encounter it all the time in the real world. Examples include gamblers who see patterns in how the cards are being dealt or investors who imagine patterns in the movement of certain stocks, or basketball fans who believe that their favorite player has the “hot hand.” But apophenia has no place in the world of data science, especially when data science is trying to help us make better decisions about critical things such as the quality of healthcare, where to allocate police resources, ensuring that our airplanes operate effectively or making investment decisions that determine our retirement readiness.

Understanding the differences between epiphany (a sudden, intuitive perception of or insight into the reality) and apophenia (the perception of or belief in connectedness among unrelated phenomena) is critical as data scientists build analytic models to quantify cause and effect. Regression Modeling is a great tool in helping to quantify cause and effect, but one still needs to leverage industry insights and sometimes just plain common sense to make sure that we are not trying to quantify “spurious relationships.” In the world of data science, one cannot automate out the importance of common sense. I think Craig Wilkey nails this distinction in this blog he wrote:

Apophenia is the propensity to see patterns in random data. It was first coined in 1958 by Klaus Conrad – a German neurologist and psychiatrist who, perhaps a little ironically, was attempting to identify early indicators of psychosis.

An apophany (an instance of apophenia) can best be defined in contrast to an epiphany. An epiphany is a moment of sudden and striking realization that leads a person to a greater degree of clarity in the nature of reality – a discovery of a truism, often hidden in plain sight. An apophany is having the experience of an epiphany, but you’re just plain wrong.

We’ve all heard some version of the old adage that correlation does not imply causation. It can be clearly demonstrated that in neighborhoods where there is an increase in ice cream consumption, there is a roughly equivalent spike in aggravated assault incidents. We’d be foolish to assume that eating ice cream makes people irrationally violent, but that doesn’t mean there’s nothing valuable to learn from this. When we broaden the lens a bit and include other variables, the connections become clearer.

In overpopulated urban environments, where there is a greater concentration of disenfranchised people – people who are statistically more likely to commit poverty crimes, and statistically less likely to have air-conditioned homes – heat waves usher in higher levels of frustration, lower levels of tolerance, and more people eating ice cream. We can also sharpen the focus by throwing in aggravation over public transit failures, brown-outs and black-outs, lower productivity, and countless other factors.

So, while enjoying tasty dairy products does not necessarily incite violence, the correlation between ice cream consumption and violence is not meaningless. Ice cream consumption analysis may indeed provide value as a leading indicator, or bellwether, of the potential for violent acts trending upward in a given community. If not a bellwether, it certainly is a valid correlation – as opposed to a simple coincidence.

The purpose of regression analysis is to identify those variables (referred to as independent variables) that help reveal valid correlations in the phenomena one is attempting to predict (the dependent variable).

Regression analysis is a tricky beast to harness. When the whole point is to find hidden correlations that may even defy intuitive understanding, it can be tempting to throw in the entire kitchen sink and see what comes out. The greatest perceived risk in that arises from patterns that may align, but are nevertheless invalid. These coincidences are referred to as ‘spurious relationships’.

If the patterns of some spurious relationship(s) happen to align with the patterns of other independent variables in a regression analysis model, the accuracy of the model will be impacted, and could be dramatically impacted.

It would be foolish to place any faith in all those quirky coincidences we always hear about with sports teams, for example. There is no reasonably conceivable way the first initial of the middle name of the first child born in some small town after the start of a sport’s season could predict the outcome of a team’s playoff standings – but I’d be genuinely surprised if there weren’t some spurious relationship to be found there.

On the other hand, we do have a valid argument for getting rid of the dramatic orchestra strike that foreshadows violent crime in movies, and replacing it with the sound of an ice cream truck.

How do we strike the balance between the desire to uncover hidden variables that provide valuable insight into trends, and the fear of creating an apophenic, potentially psychotic, regression analysis model?

My nearly two and a half decades of experience in IT have led me to the conclusion that the field suffers from rampant apopheniphobia: The irrational fear of finding ostensibly meaningful patterns in random data. (Yes, I did just make that word up. © Craig Wilkey, 2016)

Almost invariably, we simply do not push far enough. Should stock market analysis include things like weather patterns, celebrity news stories and grade school holidays? Absolutely! Classical stock market analysis techniques don’t work as well as they used to. Why? Frankly, we have a greater number of ignorant people playing the market. The proliferation of ‘Day Traders’ has crippled the old market truisms, because so many people who are affecting the market dynamics don’t have any classical training. The things that affect the moods and daily lives of ‘normal people’ need to be considered, because ‘normal people’ are far more active in the markets than they used to be. If they don’t play by the rules, then some of those rules simply cease to apply.

Apopheniphobia is fueled by fears of falling prey to spurious relationships. Who wants to be known as the person who unleashed a dangerously psychotic algorithm into the world? People think about the many statistical oddities they’ve come across, and it stunts their creative growth. For example, did you know there is a direct correlation between the per capita consumption of margarine and the divorce rate in Maine? Cheese consumption is far more dangerous than margarine consumption – it correlates with the number of people who die by becoming tangled in their bed sheets. (And you thought lactose intolerance was bad?) The number of people who drowned by falling into a pool also correlates with the number of films Nicholas Cage appeared in from 1999 through 2009. (Source)

In IT, we have a tendency to drive toward ‘proving’ clear, unambiguous relationships that quantify efforts, justify means and, more often than not, clearly align to our own preconceived notions. We want to be able to show clear lines of progression and indisputably direct relationships – we tend to believe anything less will not be trusted by those who hold the purse strings.

Our hyper-rational modes of thinking have a tendency to overshadow our creative imaginations – which, almost inevitably, leads to hampered understanding. Perhaps the greatest value of regression analysis is that it allows us to challenge our preconceived notions and learn something new. The greatest challenge with it is rarely throwing too much data at our models – it’s not having enough.

Yes, I know… We’re IT. We’re awash with data. We’re swimming in lakes of data and constantly inhaling the fumes of endless data exhaust. What we’re missing is the meaningful data extracted from unstructured information sources – in other words, the extraordinarily valuable information that’s locked away in language that has historically been inaccessible to machines – human language.

Estimates have been telling us for a decade or more that 80% of all information in a given organization is in the form of unstructured, human-readable text. I think there is nowhere that rings more true or significant than in trying to understand customer experience. I’d also argue that the majority of the most important service information is within that 80%.

Customer Experience Personalization absolutely depends on translating that human-readable text to machine-actionable data. When it comes to understanding and deriving value from actionable insights within our customer interactions, we must extract as much understanding from that unstructured text as possible and add it all to the other data in our regression models. Apopheniphobia be damned!

While it’s, admittedly, an oversimplification, it’s convenient to talk about two general approaches to extracting data from text. Text Analytics/Mining breaks the textual input into digestible chunks of string variables and uses statistical modeling techniques to find patterns in those variables. The ideal of Natural Language Processing is to develop a translation engine between human language and machine language. It uses some of the same statistical modeling approaches as Text Analytics, but goes much further by applying semantic and syntactic analysis to extract meaning, intention, sentiment and key concepts (among other things) included in the text.

Our best opportunity to achieve our vision of industry-leading Customer Experience Personalization is to take advantage of Natural Language Processing. This barely scratches the surface of what’s possible. Natural Language Processing will enable us to step aggressively toward extracting real meaning from the vast amount of otherwise machine-invisible, extraordinarily valuable content we have. Using that extracted meaning, in conjunction with our structured data points, will allow us to build truly valuable regression analysis models to understand our customers like never before. Keep pushing until the model breaks, then dial it back a scosche. That is the path to progress.

Apopheniphobia is the enemy of personalization and Customer Relationship Management. This is why I’ve decided to launch the Apopheniphobia Awareness Campaign. Please spread the word! I need to come up with a design for the lapel pin… Maybe a ribbon with as many digits of pi I can squeeze on it – with all the prime digits bolded? Maybe we should schedule a charity walk… We can follow the streets in alphabetical order.

The post Natural Language Processing: Bridging the Gap Between Big Data and Big Information appeared first on InFocus Blog | Dell EMC Services.

Read the original blog entry...

More Stories By William Schmarzo

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business” and “Big Data MBA: Driving Business Strategies with Data Science”, is responsible for setting strategy and defining the Big Data service offerings for Dell EMC’s Big Data Practice.

As a CTO within Dell EMC’s 2,000+ person consulting organization, he works with organizations to identify where and how to start their big data journeys. He’s written white papers, is an avid blogger and is a frequent speaker on the use of Big Data and data science to power an organization’s key business initiatives. He is a University of San Francisco School of Management (SOM) Executive Fellow where he teaches the “Big Data MBA” course. Bill also just completed a research paper on “Determining The Economic Value of Data”. Onalytica recently ranked Bill as #4 Big Data Influencer worldwide.

Bill has over three decades of experience in data warehousing, BI and analytics. Bill authored the Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements. Bill serves on the City of San Jose’s Technology Innovation Board, and on the faculties of The Data Warehouse Institute and Strata.

Previously, Bill was vice president of Analytics at Yahoo where he was responsible for the development of Yahoo’s Advertiser and Website analytics products, including the delivery of “actionable insights” through a holistic user experience. Before that, Bill oversaw the Analytic Applications business unit at Business Objects, including the development, marketing and sales of their industry-defining analytic applications.

Bill holds a Masters Business Administration from University of Iowa and a Bachelor of Science degree in Mathematics, Computer Science and Business Administration from Coe College.

@CloudExpo Stories
"Storpool does only block-level storage so we do one thing extremely well. The growth in data is what drives the move to software-defined technologies in general and software-defined storage," explained Boyan Ivanov, CEO and co-founder at StorPool, in this SYS-CON.tv interview at 16th Cloud Expo, held June 9-11, 2015, at the Javits Center in New York City.
Sometimes I write a blog just to formulate and organize a point of view, and I think it’s time that I pull together the bounty of excellent information about Machine Learning. This is a topic with which business leaders must become comfortable, especially tomorrow’s business leaders (tip for my next semester University of San Francisco business students!). Machine learning is a key capability that will help organizations drive optimization and monetization opportunities, and there have been some...
A strange thing is happening along the way to the Internet of Things, namely far too many devices to work with and manage. It has become clear that we'll need much higher efficiency user experiences that can allow us to more easily and scalably work with the thousands of devices that will soon be in each of our lives. Enter the conversational interface revolution, combining bots we can literally talk with, gesture to, and even direct with our thoughts, with embedded artificial intelligence, whic...
As DevOps methodologies expand their reach across the enterprise, organizations face the daunting challenge of adapting related cloud strategies to ensure optimal alignment, from managing complexity to ensuring proper governance. How can culture, automation, legacy apps and even budget be reexamined to enable this ongoing shift within the modern software factory? In her Day 2 Keynote at @DevOpsSummit at 21st Cloud Expo, Aruna Ravichandran, VP, DevOps Solutions Marketing, CA Technologies, was jo...
As Marc Andreessen says software is eating the world. Everything is rapidly moving toward being software-defined – from our phones and cars through our washing machines to the datacenter. However, there are larger challenges when implementing software defined on a larger scale - when building software defined infrastructure. In his session at 16th Cloud Expo, Boyan Ivanov, CEO of StorPool, provided some practical insights on what, how and why when implementing "software-defined" in the datacent...
Blockchain. A day doesn’t seem to go by without seeing articles and discussions about the technology. According to PwC executive Seamus Cushley, approximately $1.4B has been invested in blockchain just last year. In Gartner’s recent hype cycle for emerging technologies, blockchain is approaching the peak. It is considered by Gartner as one of the ‘Key platform-enabling technologies to track.’ While there is a lot of ‘hype vs reality’ discussions going on, there is no arguing that blockchain is b...
Blockchain is a shared, secure record of exchange that establishes trust, accountability and transparency across business networks. Supported by the Linux Foundation's open source, open-standards based Hyperledger Project, Blockchain has the potential to improve regulatory compliance, reduce cost as well as advance trade. Are you curious about how Blockchain is built for business? In her session at 21st Cloud Expo, René Bostic, Technical VP of the IBM Cloud Unit in North America, discussed the b...
You know you need the cloud, but you’re hesitant to simply dump everything at Amazon since you know that not all workloads are suitable for cloud. You know that you want the kind of ease of use and scalability that you get with public cloud, but your applications are architected in a way that makes the public cloud a non-starter. You’re looking at private cloud solutions based on hyperconverged infrastructure, but you’re concerned with the limits inherent in those technologies.
Is advanced scheduling in Kubernetes achievable?Yes, however, how do you properly accommodate every real-life scenario that a Kubernetes user might encounter? How do you leverage advanced scheduling techniques to shape and describe each scenario in easy-to-use rules and configurations? In his session at @DevOpsSummit at 21st Cloud Expo, Oleg Chunikhin, CTO at Kublr, answered these questions and demonstrated techniques for implementing advanced scheduling. For example, using spot instances and co...
The use of containers by developers -- and now increasingly IT operators -- has grown from infatuation to deep and abiding love. But as with any long-term affair, the honeymoon soon leads to needing to live well together ... and maybe even getting some relationship help along the way. And so it goes with container orchestration and automation solutions, which are rapidly emerging as the means to maintain the bliss between rapid container adoption and broad container use among multiple cloud host...
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
Imagine if you will, a retail floor so densely packed with sensors that they can pick up the movements of insects scurrying across a store aisle. Or a component of a piece of factory equipment so well-instrumented that its digital twin provides resolution down to the micrometer.
The need for greater agility and scalability necessitated the digital transformation in the form of following equation: monolithic to microservices to serverless architecture (FaaS). To keep up with the cut-throat competition, the organisations need to update their technology stack to make software development their differentiating factor. Thus microservices architecture emerged as a potential method to provide development teams with greater flexibility and other advantages, such as the abili...
In his keynote at 18th Cloud Expo, Andrew Keys, Co-Founder of ConsenSys Enterprise, provided an overview of the evolution of the Internet and the Database and the future of their combination – the Blockchain. Andrew Keys is Co-Founder of ConsenSys Enterprise. He comes to ConsenSys Enterprise with capital markets, technology and entrepreneurial experience. Previously, he worked for UBS investment bank in equities analysis. Later, he was responsible for the creation and distribution of life settle...
Product connectivity goes hand and hand these days with increased use of personal data. New IoT devices are becoming more personalized than ever before. In his session at 22nd Cloud Expo | DXWorld Expo, Nicolas Fierro, CEO of MIMIR Blockchain Solutions, will discuss how in order to protect your data and privacy, IoT applications need to embrace Blockchain technology for a new level of product security never before seen - or needed.
Leading companies, from the Global Fortune 500 to the smallest companies, are adopting hybrid cloud as the path to business advantage. Hybrid cloud depends on cloud services and on-premises infrastructure working in unison. Successful implementations require new levels of data mobility, enabled by an automated and seamless flow across on-premises and cloud resources. In his general session at 21st Cloud Expo, Greg Tevis, an IBM Storage Software Technical Strategist and Customer Solution Architec...
Nordstrom is transforming the way that they do business and the cloud is the key to enabling speed and hyper personalized customer experiences. In his session at 21st Cloud Expo, Ken Schow, VP of Engineering at Nordstrom, discussed some of the key learnings and common pitfalls of large enterprises moving to the cloud. This includes strategies around choosing a cloud provider(s), architecture, and lessons learned. In addition, he covered some of the best practices for structured team migration an...
In his general session at 21st Cloud Expo, Greg Dumas, Calligo’s Vice President and G.M. of US operations, discussed the new Global Data Protection Regulation and how Calligo can help business stay compliant in digitally globalized world. Greg Dumas is Calligo's Vice President and G.M. of US operations. Calligo is an established service provider that provides an innovative platform for trusted cloud solutions. Calligo’s customers are typically most concerned about GDPR compliance, application p...
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
In his session at 21st Cloud Expo, Raju Shreewastava, founder of Big Data Trunk, provided a fun and simple way to introduce Machine Leaning to anyone and everyone. He solved a machine learning problem and demonstrated an easy way to be able to do machine learning without even coding. Raju Shreewastava is the founder of Big Data Trunk (www.BigDataTrunk.com), a Big Data Training and consulting firm with offices in the United States. He previously led the data warehouse/business intelligence and B...