Welcome!

Machine Learning Authors: Liz McMillan, Elizabeth White, Pat Romanski, William Schmarzo, Progress Blog

Related Topics: @DXWorldExpo, Machine Learning , @ThingsExpo

@DXWorldExpo: Blog Feed Post

Natural Language Processing | @BigDataExpo #BigData #Analytics #DataScience

Bridging the Gap Between Big Data and Big Information

Apophenia is the propensity to see patterns in random data.”  We encounter it all the time in the real world. Examples include gamblers who see patterns in how the cards are being dealt or investors who imagine patterns in the movement of certain stocks, or basketball fans who believe that their favorite player has the “hot hand.” But apophenia has no place in the world of data science, especially when data science is trying to help us make better decisions about critical things such as the quality of healthcare, where to allocate police resources, ensuring that our airplanes operate effectively or making investment decisions that determine our retirement readiness.

Understanding the differences between epiphany (a sudden, intuitive perception of or insight into the reality) and apophenia (the perception of or belief in connectedness among unrelated phenomena) is critical as data scientists build analytic models to quantify cause and effect. Regression Modeling is a great tool in helping to quantify cause and effect, but one still needs to leverage industry insights and sometimes just plain common sense to make sure that we are not trying to quantify “spurious relationships.” In the world of data science, one cannot automate out the importance of common sense. I think Craig Wilkey nails this distinction in this blog he wrote:

Apophenia is the propensity to see patterns in random data. It was first coined in 1958 by Klaus Conrad – a German neurologist and psychiatrist who, perhaps a little ironically, was attempting to identify early indicators of psychosis.

An apophany (an instance of apophenia) can best be defined in contrast to an epiphany. An epiphany is a moment of sudden and striking realization that leads a person to a greater degree of clarity in the nature of reality – a discovery of a truism, often hidden in plain sight. An apophany is having the experience of an epiphany, but you’re just plain wrong.

We’ve all heard some version of the old adage that correlation does not imply causation. It can be clearly demonstrated that in neighborhoods where there is an increase in ice cream consumption, there is a roughly equivalent spike in aggravated assault incidents. We’d be foolish to assume that eating ice cream makes people irrationally violent, but that doesn’t mean there’s nothing valuable to learn from this. When we broaden the lens a bit and include other variables, the connections become clearer.

In overpopulated urban environments, where there is a greater concentration of disenfranchised people – people who are statistically more likely to commit poverty crimes, and statistically less likely to have air-conditioned homes – heat waves usher in higher levels of frustration, lower levels of tolerance, and more people eating ice cream. We can also sharpen the focus by throwing in aggravation over public transit failures, brown-outs and black-outs, lower productivity, and countless other factors.

So, while enjoying tasty dairy products does not necessarily incite violence, the correlation between ice cream consumption and violence is not meaningless. Ice cream consumption analysis may indeed provide value as a leading indicator, or bellwether, of the potential for violent acts trending upward in a given community. If not a bellwether, it certainly is a valid correlation – as opposed to a simple coincidence.

The purpose of regression analysis is to identify those variables (referred to as independent variables) that help reveal valid correlations in the phenomena one is attempting to predict (the dependent variable).

Regression analysis is a tricky beast to harness. When the whole point is to find hidden correlations that may even defy intuitive understanding, it can be tempting to throw in the entire kitchen sink and see what comes out. The greatest perceived risk in that arises from patterns that may align, but are nevertheless invalid. These coincidences are referred to as ‘spurious relationships’.

If the patterns of some spurious relationship(s) happen to align with the patterns of other independent variables in a regression analysis model, the accuracy of the model will be impacted, and could be dramatically impacted.

It would be foolish to place any faith in all those quirky coincidences we always hear about with sports teams, for example. There is no reasonably conceivable way the first initial of the middle name of the first child born in some small town after the start of a sport’s season could predict the outcome of a team’s playoff standings – but I’d be genuinely surprised if there weren’t some spurious relationship to be found there.

On the other hand, we do have a valid argument for getting rid of the dramatic orchestra strike that foreshadows violent crime in movies, and replacing it with the sound of an ice cream truck.

How do we strike the balance between the desire to uncover hidden variables that provide valuable insight into trends, and the fear of creating an apophenic, potentially psychotic, regression analysis model?

My nearly two and a half decades of experience in IT have led me to the conclusion that the field suffers from rampant apopheniphobia: The irrational fear of finding ostensibly meaningful patterns in random data. (Yes, I did just make that word up. © Craig Wilkey, 2016)

Almost invariably, we simply do not push far enough. Should stock market analysis include things like weather patterns, celebrity news stories and grade school holidays? Absolutely! Classical stock market analysis techniques don’t work as well as they used to. Why? Frankly, we have a greater number of ignorant people playing the market. The proliferation of ‘Day Traders’ has crippled the old market truisms, because so many people who are affecting the market dynamics don’t have any classical training. The things that affect the moods and daily lives of ‘normal people’ need to be considered, because ‘normal people’ are far more active in the markets than they used to be. If they don’t play by the rules, then some of those rules simply cease to apply.

Apopheniphobia is fueled by fears of falling prey to spurious relationships. Who wants to be known as the person who unleashed a dangerously psychotic algorithm into the world? People think about the many statistical oddities they’ve come across, and it stunts their creative growth. For example, did you know there is a direct correlation between the per capita consumption of margarine and the divorce rate in Maine? Cheese consumption is far more dangerous than margarine consumption – it correlates with the number of people who die by becoming tangled in their bed sheets. (And you thought lactose intolerance was bad?) The number of people who drowned by falling into a pool also correlates with the number of films Nicholas Cage appeared in from 1999 through 2009. (Source)

In IT, we have a tendency to drive toward ‘proving’ clear, unambiguous relationships that quantify efforts, justify means and, more often than not, clearly align to our own preconceived notions. We want to be able to show clear lines of progression and indisputably direct relationships – we tend to believe anything less will not be trusted by those who hold the purse strings.

Our hyper-rational modes of thinking have a tendency to overshadow our creative imaginations – which, almost inevitably, leads to hampered understanding. Perhaps the greatest value of regression analysis is that it allows us to challenge our preconceived notions and learn something new. The greatest challenge with it is rarely throwing too much data at our models – it’s not having enough.

Yes, I know… We’re IT. We’re awash with data. We’re swimming in lakes of data and constantly inhaling the fumes of endless data exhaust. What we’re missing is the meaningful data extracted from unstructured information sources – in other words, the extraordinarily valuable information that’s locked away in language that has historically been inaccessible to machines – human language.

Estimates have been telling us for a decade or more that 80% of all information in a given organization is in the form of unstructured, human-readable text. I think there is nowhere that rings more true or significant than in trying to understand customer experience. I’d also argue that the majority of the most important service information is within that 80%.

Customer Experience Personalization absolutely depends on translating that human-readable text to machine-actionable data. When it comes to understanding and deriving value from actionable insights within our customer interactions, we must extract as much understanding from that unstructured text as possible and add it all to the other data in our regression models. Apopheniphobia be damned!

While it’s, admittedly, an oversimplification, it’s convenient to talk about two general approaches to extracting data from text. Text Analytics/Mining breaks the textual input into digestible chunks of string variables and uses statistical modeling techniques to find patterns in those variables. The ideal of Natural Language Processing is to develop a translation engine between human language and machine language. It uses some of the same statistical modeling approaches as Text Analytics, but goes much further by applying semantic and syntactic analysis to extract meaning, intention, sentiment and key concepts (among other things) included in the text.

Our best opportunity to achieve our vision of industry-leading Customer Experience Personalization is to take advantage of Natural Language Processing. This barely scratches the surface of what’s possible. Natural Language Processing will enable us to step aggressively toward extracting real meaning from the vast amount of otherwise machine-invisible, extraordinarily valuable content we have. Using that extracted meaning, in conjunction with our structured data points, will allow us to build truly valuable regression analysis models to understand our customers like never before. Keep pushing until the model breaks, then dial it back a scosche. That is the path to progress.

Apopheniphobia is the enemy of personalization and Customer Relationship Management. This is why I’ve decided to launch the Apopheniphobia Awareness Campaign. Please spread the word! I need to come up with a design for the lapel pin… Maybe a ribbon with as many digits of pi I can squeeze on it – with all the prime digits bolded? Maybe we should schedule a charity walk… We can follow the streets in alphabetical order.

The post Natural Language Processing: Bridging the Gap Between Big Data and Big Information appeared first on InFocus Blog | Dell EMC Services.

Read the original blog entry...

More Stories By William Schmarzo

Bill Schmarzo, author of “Big Data: Understanding How Data Powers Big Business”, is responsible for setting the strategy and defining the Big Data service line offerings and capabilities for the EMC Global Services organization. As part of Bill’s CTO charter, he is responsible for working with organizations to help them identify where and how to start their big data journeys. He’s written several white papers, avid blogger and is a frequent speaker on the use of Big Data and advanced analytics to power organization’s key business initiatives. He also teaches the “Big Data MBA” at the University of San Francisco School of Management.

Bill has nearly three decades of experience in data warehousing, BI and analytics. Bill authored EMC’s Vision Workshop methodology that links an organization’s strategic business initiatives with their supporting data and analytic requirements, and co-authored with Ralph Kimball a series of articles on analytic applications. Bill has served on The Data Warehouse Institute’s faculty as the head of the analytic applications curriculum.

Previously, Bill was the Vice President of Advertiser Analytics at Yahoo and the Vice President of Analytic Applications at Business Objects.

@CloudExpo Stories
Coca-Cola’s Google powered digital signage system lays the groundwork for a more valuable connection between Coke and its customers. Digital signs pair software with high-resolution displays so that a message can be changed instantly based on what the operator wants to communicate or sell. In their Day 3 Keynote at 21st Cloud Expo, Greg Chambers, Global Group Director, Digital Innovation, Coca-Cola, and Vidya Nagarajan, a Senior Product Manager at Google, discussed how from store operations and ...
In his session at 21st Cloud Expo, Carl J. Levine, Senior Technical Evangelist for NS1, will objectively discuss how DNS is used to solve Digital Transformation challenges in large SaaS applications, CDNs, AdTech platforms, and other demanding use cases. Carl J. Levine is the Senior Technical Evangelist for NS1. A veteran of the Internet Infrastructure space, he has over a decade of experience with startups, networking protocols and Internet infrastructure, combined with the unique ability to it...
"Codigm is based on the cloud and we are here to explore marketing opportunities in America. Our mission is to make an ecosystem of the SW environment that anyone can understand, learn, teach, and develop the SW on the cloud," explained Sung Tae Ryu, CEO of Codigm, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"We're developing a software that is based on the cloud environment and we are providing those services to corporations and the general public," explained Seungmin Kim, CEO/CTO of SM Systems Inc., in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"We're focused on how to get some of the attributes that you would expect from an Amazon, Azure, Google, and doing that on-prem. We believe today that you can actually get those types of things done with certain architectures available in the market today," explained Steve Conner, VP of Sales at Cloudistics, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Enterprises are moving to the cloud faster than most of us in security expected. CIOs are going from 0 to 100 in cloud adoption and leaving security teams in the dust. Once cloud is part of an enterprise stack, it’s unclear who has responsibility for the protection of applications, services, and data. When cloud breaches occur, whether active compromise or a publicly accessible database, the blame must fall on both service providers and users. In his session at 21st Cloud Expo, Ben Johnson, C...
SYS-CON Events announced today that Telecom Reseller has been named “Media Sponsor” of SYS-CON's 22nd International Cloud Expo, which will take place on June 5-7, 2018, at the Javits Center in New York, NY. Telecom Reseller reports on Unified Communications, UCaaS, BPaaS for enterprise and SMBs. They report extensively on both customer premises based solutions such as IP-PBX as well as cloud based and hosted platforms.
WebRTC is great technology to build your own communication tools. It will be even more exciting experience it with advanced devices, such as a 360 Camera, 360 microphone, and a depth sensor camera. In his session at @ThingsExpo, Masashi Ganeko, a manager at INFOCOM Corporation, introduced two experimental projects from his team and what they learned from them. "Shotoku Tamago" uses the robot audition software HARK to track speakers in 360 video of a remote party. "Virtual Teleport" uses a multip...
Gemini is Yahoo’s native and search advertising platform. To ensure the quality of a complex distributed system that spans multiple products and components and across various desktop websites and mobile app and web experiences – both Yahoo owned and operated and third-party syndication (supply), with complex interaction with more than a billion users and numerous advertisers globally (demand) – it becomes imperative to automate a set of end-to-end tests 24x7 to detect bugs and regression. In th...
"The reason Tier 1 companies are coming to us is we're able to narrow the gap where custom applications need to be built. They provide a lot of services, like IBM has Watson, and they provide a lot of hardware but how do you bring it all together? Bringing it all together they have to build custom applications and that's the niche that we are able to help them with," explained Peter Jung, Product Leader at Pulzze Systems Inc., in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2,...
While some developers care passionately about how data centers and clouds are architected, for most, it is only the end result that matters. To the majority of companies, technology exists to solve a business problem, and only delivers value when it is solving that problem. 2017 brings the mainstream adoption of containers for production workloads. In his session at 21st Cloud Expo, Ben McCormack, VP of Operations at Evernote, discussed how data centers of the future will be managed, how the p...
"Cloud Academy is an enterprise training platform for the cloud, specifically public clouds. We offer guided learning experiences on AWS, Azure, Google Cloud and all the surrounding methodologies and technologies that you need to know and your teams need to know in order to leverage the full benefits of the cloud," explained Alex Brower, VP of Marketing at Cloud Academy, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clar...
"There's plenty of bandwidth out there but it's never in the right place. So what Cedexis does is uses data to work out the best pathways to get data from the origin to the person who wants to get it," explained Simon Jones, Evangelist and Head of Marketing at Cedexis, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
Data scientists must access high-performance computing resources across a wide-area network. To achieve cloud-based HPC visualization, researchers must transfer datasets and visualization results efficiently. HPC clusters now compute GPU-accelerated visualization in the cloud cluster. To efficiently display results remotely, a high-performance, low-latency protocol transfers the display from the cluster to a remote desktop. Further, tools to easily mount remote datasets and efficiently transfer...
High-velocity engineering teams are applying not only continuous delivery processes, but also lessons in experimentation from established leaders like Amazon, Netflix, and Facebook. These companies have made experimentation a foundation for their release processes, allowing them to try out major feature releases and redesigns within smaller groups before making them broadly available. In his session at 21st Cloud Expo, Brian Lucas, Senior Staff Engineer at Optimizely, discussed how by using ne...
"We work around really protecting the confidentiality of information, and by doing so we've developed implementations of encryption through a patented process that is known as superencipherment," explained Richard Blech, CEO of Secure Channels Inc., in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
"MobiDev is a software development company and we do complex, custom software development for everybody from entrepreneurs to large enterprises," explained Alan Winters, U.S. Head of Business Development at MobiDev, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that Evatronix will exhibit at SYS-CON's 21st International Cloud Expo®, which will take place on Oct 31 – Nov 2, 2017, at the Santa Clara Convention Center in Santa Clara, CA. Evatronix SA offers comprehensive solutions in the design and implementation of electronic systems, in CAD / CAM deployment, and also is a designer and manufacturer of advanced 3D scanners for professional applications.
The question before companies today is not whether to become intelligent, it’s a question of how and how fast. The key is to adopt and deploy an intelligent application strategy while simultaneously preparing to scale that intelligence. In her session at 21st Cloud Expo, Sangeeta Chakraborty, Chief Customer Officer at Ayasdi, provided a tactical framework to become a truly intelligent enterprise, including how to identify the right applications for AI, how to build a Center of Excellence to oper...
"IBM is really all in on blockchain. We take a look at sort of the history of blockchain ledger technologies. It started out with bitcoin, Ethereum, and IBM evaluated these particular blockchain technologies and found they were anonymous and permissionless and that many companies were looking for permissioned blockchain," stated René Bostic, Technical VP of the IBM Cloud Unit in North America, in this SYS-CON.tv interview at 21st Cloud Expo, held Oct 31 – Nov 2, 2017, at the Santa Clara Conventi...