Welcome!

IoT User Interface Authors: Elizabeth White, Ram Sonagara, Liz McMillan, AppDynamics Blog, Yakov Fain

Related Topics: @CloudExpo, Java IoT, Microservices Expo, Containers Expo Blog, Release Management , Apache

@CloudExpo: Blog Post

Cloudera Impala – Closing the Near Real Time Gap Working with Big Data

Building data structures and loading data

By

On October 24, 2012 Cloudera announced the release of Cloudera Impala and the commercial support subscription service of Cloudera Enterprise Real Time Query (RTQ). During the Hadoop World/STRATA Conference in NYC, I was invited over to see a demonstration. Impala is a SQL based Real Time Query/Ad Hoc query engine built on top of HDFS or Hbase. As I watched the demonstration unfold, I wondered if one of the remaining technology gaps in the NOSQL arsenal had been closed.  What gap you ask? Near Real Time Analytics on a NOSQL stack. Working with customers across the Cyber Security customer space, not only do they face the familiar BIGDATA horsemen of the apocalypse: Volume, Velocity and Variety but one more large challenge crept in: Time (V3T).  The Near Real Time Analysis/Near Real Time Analytic capability that Cloudera Impala provides is essential in many high value use cases associated with Cyber Security: comparing current activity with observed historical norms, correlation of many disparate data sources/enrichment and automated threat detection algorithms.

When the demonstration concluded, the Cloudera representatives and I discussed the potential of performing an informal independent evaluation of Cloudera Impala against some of the common Real Time/Near Real Time use cases in Cyber Security. I agreed to step up and perform an independent evaluation as well as developing a demonstration platform for FedCyber 2012 (almost three weeks hence for inquiring minds).  So let us set the field: a new BETA technology, NO prior exposure to the technology or documentation, a vendor making promises, addressing a large technology gap and three weeks to implement, seemed straight forward; no pressure.

The day after I returned from the STRATA Conference, I returned to my office and provisioned four Virtual Machines in order to build the Impala demonstration. As a committer/contributor for SherpaSurfing an open source Cyber Security solution, I have an abundance of data sets, enrichment sources, Hive data structures and services.  Given the amount of time and the audience for FedCyber 2012, I decided to focus on some Intrusion Detection and Netflow related use cases for the demonstration. The data sets for the demonstration included base data sets:  20 million Netflow events, 8 million Intrusion Detection System events and enrichment: Geographic, Blacklist, Whitelist and Protocol related information. Each of the selected uses cases for this demonstration is critical to the Perform Near-Real Time Network Analysis domain in Cyber Security. The name for the demonstration system was decided to be the Impala Mission Demonstration Platform (IMDP).  The IMDP was implemented based on vendor recommendations with no tuning or optimization.

The IMDP effort provided me with my first opportunity to work with Cloudera Manager. Although this post is focused on Cloudera Impala I would be remiss not to mention Cloudera Manager. I have worked with Hadoop since 1.0 and built more than a few clusters over the years. I used the installation and configuration guides provided with Cloudera Impala and followed the recommendations. One of the first recommendations was use of the Cloudera Manager. Using the Cloudera Manager (CDH 4.1), I was able to roll out a four node cluster in two hours.  I was able to discover the hosts, manage services and provision them in accordance with the IMDP deployment plan. The deployment plan consisted of:

  • node 1 – hbase, hdfs, impala,  mapreduce
  • node2 – hbase, hdfs, impala,  mapreduce
  • node3 – hbase(region server, master), hdfs(namenode), impala(impalad, statestore),  mapreduce(job tracker, tasktracker) , hue, oozie and zookeeper
  • node4 – Application Tier, Cloudera Manager

The Cloudera Manager saved at least two days of effort in deploying the cluster, the tight integration with the support portal, comprehensive help and one place to work with all properties of the entire cluster and view space consumption metrics; verdict on Cloudera Manager: Cloudera masterful, bold stroke, thumbs up.

Now that the cluster build-out completed; I shifted attention to deploying and configuring the Cloudera Impala service.  Using Cloudera Manager, I deployed Impala on three nodes: three instances of Impalad and one impala state store, in a matter of minutes. I completed the deployment and configuration of the Hive MetaStore. Keeping in mind this is a BETA; the documentation was complete, but fragmented on deployment and configuration (HIVE MetaStore portion); verdict on impala deployment and configuration: solid for a BETA (needs an example hive-site.xml, configuration guide needs better flow).

At this point all configuration and deployment was completed, attention turned to building data structures and loading data. I took the Data Definition Language (DDL) scripts or data structures for ten data sources and enrichment; ported them over to Hive and tested them in less than four hours. It is worthy of mention that the data sources for this demonstration are large flat tables: netflow and intrusion detection system. Cloudera Impala uses HIVE as an Extract Transform Load (ETL) engine, using Hive I defined all of the data structures in source files which were sourced using hive shell: created a database (Sherpa). Hive was then used to load data into the tables that were just created. Creating data structures in Hive was simple as usual and loading data sets was quick (20 million netflow events in 57 seconds). Logging into impala-shell, issued a refresh of the MetaStore and I was working with data. I performed verification of the data load, all data loaded and no issues were revealed. One area of potential improvement would be more comprehensive messages on load failure. Defining the data structures and loading data using Hive was nothing new; verdict:  really good; easy to use, easy to load, but need to improve failed load messages.

Finally, we moved on to the most interesting stage which is using Cloudera Impala in a series of Real Time Query (RTQ) scenarios that are common across the Cyber Security customer space. The real world scenarios selected come from the perform netflow analysis set of use case(s). In each of these scenarios, the exact same queries were executed on the same cluster using Hive and then Impala against the same data structures (database and tables).  In the Hive approach, we traverse the batch processing stack and with Impala we traverse the Real Time Query (RTQ) stack performing a series of analytics. In the first use case, I ran a five tuple (sip, sport, dip, dport, protocol) summary covering bytes per packet, summing bytes and packets for a 20 million event set resulted in: identical result sets, Hive 82 seconds – Impala 6 seconds.   In the second use case, I performed a summary of destination ports where the source port is 80 which resulted in: identical result sets, Hive 57 seconds, Impala 5 seconds. In the third use case, I performed correlation between netflow and intrusion detection systems, correlating netflow with intrusion detection events for several hours which resulted in: identical result sets, Hive 40 seconds, Impala sub-second.  Finally, for FedCyber 2012, I developed a java based situational awareness dashboard which connected to Cloudera Impala via ODBC and executed analytics performing: correlation of blacklists, Intrusion Detection, Netflow, statistical cubes for ten hours with a refresh of every five seconds without failure or issue.  The ODBC implementation easily provided the ability to export data to desktop tools (using ODBC) and common BI tools as advertised. Developing and Using Cloudera Impala verdict: This is as advertised; easy to use, easy to implement on, very fast, very flexible and more than capable of running real time analytics. The Impala shell is limited but much of the demonstration work was done using result sets so it was not an impediment.

In summation, I have worked for over a decade across the vast BIGDATA technology space covering Legacy Relational Database, Data Warehouse, and NOSQL; Cloudera Impala proved more than capable of running near real time analytics and providing mission relevance to customers with a Near Real Time (NRT) requirement.  Based on my initial review Cloudera Impala appears to be a bold step in closing the gap of near real time analytics on a NOSQL stack. I did encounter some minor problems, but the few problems and limitations that were encountered in this demonstration were documented and published in the known issues document so they will not be shared; none were show stoppers.

The notes, details and all of the lessons learned, data structures and the configuration guide from the demonstration are being published out on Github under SherpaSurfing in the coming days. These documents cover everything in detail and will enable developers to replicate the demonstration platform and get a jump start on Cloudera Impala.  Finally, I would like to thank two contributors: Hanh Le, Robert Webb and Six3 Systems for helping me pull this off.

Read the original blog entry...

More Stories By Bob Gourley

Bob Gourley writes on enterprise IT. He is a founder and partner at Cognitio Corp and publsher of CTOvision.com

@CloudExpo Stories
As organizations shift towards IT-as-a-service models, the need for managing and protecting data residing across physical, virtual, and now cloud environments grows with it. Commvault can ensure protection, access and E-Discovery of your data – whether in a private cloud, a Service Provider delivered public cloud, or a hybrid cloud environment – across the heterogeneous enterprise. In his general session at 18th Cloud Expo, Randy De Meno, Chief Technologist - Windows Products and Microsoft Part...
How will your company move to the cloud while ensuring a solid security posture? Organizations from small to large are increasingly adopting cloud solutions to deliver essential business services at a much lower cost. According to cyber security experts, the frequency and severity of cyber-attacks are on the rise, causing alarm to businesses and customers across a variety of industries. To defend against exploits like these, a company must adopt a comprehensive security defense strategy that is ...
The IoTs will challenge the status quo of how IT and development organizations operate. Or will it? Certainly the fog layer of IoT requires special insights about data ontology, security and transactional integrity. But the developmental challenges are the same: People, Process and Platform. In his session at @ThingsExpo, Craig Sproule, CEO of Metavine, will demonstrate how to move beyond today's coding paradigm and share the must-have mindsets for removing complexity from the development proc...
In today's enterprise, digital transformation represents organizational change even more so than technology change, as customer preferences and behavior drive end-to-end transformation across lines of business as well as IT. To capitalize on the ubiquitous disruption driving this transformation, companies must be able to innovate at an increasingly rapid pace. Traditional approaches for driving innovation are now woefully inadequate for keeping up with the breadth of disruption and change facin...
SYS-CON Events announced today that EastBanc Technologies will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. EastBanc Technologies has been working at the frontier of technology since 1999. Today, the firm provides full-lifecycle software development delivering flexible technology solutions that seamlessly integrate with existing systems – whether on premise or cloud. EastBanc Technologies partners with p...
SYS-CON Events announced today that MangoApps will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. MangoApps provides modern company intranets and team collaboration software, allowing workers to stay connected and productive from anywhere in the world and from any device. For more information, please visit https://www.mangoapps.com/.
The cloud era has reached the stage where it is no longer a question of whether a company should migrate, but when. Enterprises have embraced the outsourcing of where their various applications are stored and who manages them, saving significant investment along the way. Plus, the cloud has become a defining competitive edge. Companies that fail to successfully adapt risk failure. The media, of course, continues to extol the virtues of the cloud, including how easy it is to get there. Migrating...
SYS-CON Events announced today that Catchpoint Systems, Inc., a provider of innovative web and infrastructure monitoring solutions, has been named “Silver Sponsor” of SYS-CON's DevOps Summit at 18th Cloud Expo New York, which will take place June 7-9, 2016, at the Javits Center in New York City, NY. Catchpoint is a leading Digital Performance Analytics company that provides unparalleled insight into customer-critical services to help consistently deliver an amazing customer experience. Designed...
SYS-CON Events announced today that Super Micro Computer, Inc., a global leader in Embedded and IoT solutions, will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Supermicro (NASDAQ: SMCI), the leading innovator in high-performance, high-efficiency server technology, is a premier provider of advanced server Building Block Solutions® for Data Center, Cloud Computing, Enterprise IT, Hadoop/Big Data, HPC and ...
18th Cloud Expo, taking place June 7-9, 2016, at the Javits Center in New York City, NY, will feature technical sessions from a rock star conference faculty and the leading industry players in the world. Cloud computing is now being embraced by a majority of enterprises of all sizes. Yesterday's debate about public vs. private has transformed into the reality of hybrid cloud: a recent survey shows that 74% of enterprises have a hybrid cloud strategy. Meanwhile, 94% of enterprises are using some...
@DevOpsSummit taking place June 7-9, 2016 at Javits Center, New York City, and Nov 1-3, 2016, at the Santa Clara Convention Center in Santa Clara, CA, is co-located with the 18th International @CloudExpo and will feature technical sessions from a rock star conference faculty and the leading industry players in the world.
SYS-CON Events announced today Object Management Group® has been named “Media Sponsor” of SYS-CON's 18th International Cloud Expo, which will take place on June 7–9, 2016, at the Javits Center in New York City, NY, and the 19th International Cloud Expo, which will take place on November 1–3, 2016, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that IBM Cloud Data Services has been named “Bronze Sponsor” of SYS-CON's 18th Cloud Expo, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. IBM Cloud Data Services offers a portfolio of integrated, best-of-breed cloud data services for developers focused on mobile computing and analytics use cases.
Cloud computing delivers on-demand resources that provide businesses with flexibility and cost-savings. The challenge in moving workloads to the cloud has been the cost and complexity of ensuring the initial and ongoing security and regulatory (PCI, HIPAA, FFIEC) compliance across private and public clouds. Manual security compliance is slow, prone to human error, and represents over 50% of the cost of managing cloud applications. Determining how to automate cloud security compliance is critical...
The Internet of Things (IoT) is growing rapidly by extending current technologies, products and networks. By 2020, Cisco estimates there will be 50 billion connected devices. Gartner has forecast revenues of over $300 billion, just to IoT suppliers. Now is the time to figure out how you’ll make money – not just create innovative products. With hundreds of new products and companies jumping into the IoT fray every month, there’s no shortage of innovation. Despite this, McKinsey/VisionMobile data...
We’ve been doing it for years, decades for some. How many websites have you created accounts on? Your bank, your credit card companies, social media sites, hotels and travel sites, online shopping sites, and that’s just the start. We do it often without even thinking about it, quickly entering our personal information, our data, in a plethora of systems. Sometimes we’re not even aware of the information we are providing. It could be very personal information (think of the security questions you ...
As machines are increasingly connected to the internet, it’s becoming easier to discover the numerous ways Industrial IoT (IIoT) is helping to shape the business world. This is exactly why we have decided to take a closer look at this pervasive movement and to examine the desire to connect more things! Now if you need a refresher on IIoT and how it is changing the world, take a moment and listen to Greg Gorbach with ARC Advisory Group. Gorbach believes, "IIoT will significantly change the worl...
"What we see what happens when you have a completely networked society and the potential to now drive the value creation and the collaboration and the ecosystems that are possible when you start to be able to connect people and industries together in ways that have never been possible before," explained Esmeralda Swartz, VP of Marketing Enterprise & Cloud at Ericsson, in this SYS-CON.tv interview at @ThingsExpo, held November 3-5, 2015, at the Santa Clara Convention Center in Santa Clara, CA.
SYS-CON Events announced today that AppNeta, the leader in performance insight for business-critical web applications, will exhibit and present at SYS-CON's @DevOpsSummit at Cloud Expo New York, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. AppNeta is the only application performance monitoring (APM) company to provide solutions for all applications – applications you develop internally, business-critical SaaS applications you use and the networks that deli...
SYS-CON Events announced today that Zerto will exhibit at SYS-CON's 18th International Cloud Expo®, which will take place on June 7-9, 2016, at the Javits Center in New York City, NY. Zerto is committed to keeping enterprise and cloud IT running 24/7 by providing innovative, simple, reliable and scalable business continuity software solutions. Through the Zerto Cloud Continuity Platform™, organizations can seamlessly move and protect virtualized workloads between public, private and hybrid clou...