What is machine learning? Software derived from data

What is machine learning? Software derived from data

You’ve probably encountered the term “machine learning” more than a few times lately. Often used interchangeably with artificial intelligence, machine learning is in fact a subset of AI, both of which can trace their roots to MIT in the late 1950s.

Machine learning is something you probably encounter every day, whether you know it or not. The Siri and Alexa voice assistants, Facebook’s and Microsoft’s facial recognition, Amazon and Netflix recommendations, the technology that keeps self-driving cars from crashing into things – all are a result of advances in machine learning.

While still nowhere near as complex as a human brain, systems based on machine learning have achieved some impressive feats, like defeating human challengers at chess, Jeopardy, Go, and Texas Hold ‘em.

Dismissed for decades as overhyped and unrealistic (the infamous ”AI winter”), both AI and machine learning have enjoyed a huge resurgence over the last few years, thanks to a number of technological breakthroughs, a massive explosion in cheap computing horsepower, and a bounty of data for machine learning models to chew on.

Self-taught software

So what is machine learning, exactly? Let’s start by noting what it is not: a conventional, hand-coded, human-programmed computing application.

Unlike traditional software, which is great at following instructions but terrible at improvising, machine learning systems essentially code themselves, developing their own instructions by generalizing from examples.

The classic example is image recognition. Show a machine learning system enough photos of dogs (labeled “dogs”), as well as pictures of cats, trees, babies, bananas, or any other object (labeled “not dogs”), and if the system is trained correctly it will eventually get good at identifying canines, without a human being ever telling it what a dog is supposed to look like.

The spam filter in your email program is a good example of machine learning in action. After being exposed to hundreds of millions of spam samples, as well as non-spam email, it has learned to identify the key characteristics of those nasty unwanted messages. It’s not perfect, but it’s usually pretty accurate.

Supervised vs. unsupervised learning

This kind of machine learning is called supervised learning, which means that someone exposed the machine learning algorithm to an enormous set of training data, examined its output, then continuously tweaked its settings until it produced the expected result when shown data it had not seen before. (This is analogous to clicking the “not spam” button in your inbox when the filter traps a legitimate message by accident. The more you do that, the more the accuracy of the filter should improve.)

The most common supervised learning tasks involve classification and prediction (i.e, “regression”). Spam detection and image recognition are both classification problems. Predicting stock prices is a classic example of a regression problem.

A second kind of machine learning is called unsupervised learning. This is where the system pores over vast amounts of data to learn what “normal” data looks like, so it can detect anomalies and hidden patterns. Unsupervised machine learning is useful when you don’t really know what you’re looking for, so you can’t train the system to find it.

Unsupervised machine learning systems can identify patterns in vast amounts of data many times faster than humans can, which is why banks use them to flag fraudulent transactions, marketers deploy them to identify customers with similar attributes, and security software employs them to detect hostile activity on a network.

Clustering and association rule learning are two examples of unsupervised learning algorithms. Clustering is the secret sauce behind customer segmentation, for example, while association rule learning is used for recommendation engines.

Limitations of machine learning

Because each machine learning system creates its own connections, how a particular one actually works can be a bit of a black box. You can’t always reverse engineer the process to discover why your system can distinguish between a Pekingese and a Persian. As long as it works, it doesn’t really matter.

But a machine learning system is only as good as the data it has been exposed to – the classic example of “garbage in, garbage out.” When poorly trained or exposed to an insufficient data set, a machine learning algorithm can produce results that are not only wrong but discriminatory.

HP got into trouble back in 2009 when facial recognition technology built into the webcam on an HP MediaSmart laptop was able to unable to detect the faces of African Americans. In June 2015, faulty algorithms in the Google Photos app mislabeled two black Americans as gorillas

Another dramatic example: Microsoft’s ill-fated Taybot, a March 2016 experiment to see if an AI system could emulate human conversation by learning from tweets. In less than a day, malicious Twitter trolls had turned Tay into a hate-speech-spewing chat bot from hell. Talk about corrupted training data.

A machine learning lexicon

But machine learning is really just the tip of the AI berg. Other terms closely associated with machine learning are neural networks, deep learning, and cognitive computing.

Neural network. A computer architecture designed to mimic the structure of neurons in our brains, with each artificial neuron (microcircuit) connecting to other neurons inside the system. Neural networks are arranged in layers, with neurons in one layer passing data to multiple neurons in the next layer, and so on, until eventually they reach the output layer. This final layer is where the neural network presents its best guesses as to, say, what that dog-shaped object was, along with a confidence score.

There are multiple types of neural networks for solving different types of problems. Networks with large numbers of layers are called “deep neural networks.” Neural nets are some of the most important tools used in machine learning scenarios, but not the only ones.

Deep learning. This is essentially machine learning on steroids, using multi-layered (deep) neural networks to arrive at decisions based on “imperfect” or incomplete information. The deep learning system DeepStack is what defeated 11 professional poker players last December, by constantly recomputing its strategy after each round of bets. 

Cognitive computing. This is the term favored by IBM, creators of Watson, the supercomputer that kicked humanity’s ass at Jeopardy in 2011. The difference between cognitive computing and artificial intelligence, in IBM’s view, is that instead of replacing human intelligence, cognitive computing is designed to augment it—enabling doctors to diagnose illnesses more accurately, financial managers to make smarter recommendations, lawyers to search caselaw more quickly, and so on.

This, of course, is an extremely superficial overview. Those who want to dive more deeply into the intricacies of AI and machine learning can start with this semi-wonky tutorial from the University of Washington’s Pedro Domingos, or this series of Medium posts from Adam Geitgey, as well as “What deep learning really means” by InfoWorld’s Martin Heller.

Despite all the hype about AI, it’s not an overstatement to say that machine learning and the technologies associated with it are changing the world as we know it. Best to learn about it now, before the machines become fully self-aware.

Source: InfoWorld Big Data

Equinix Collaborates With SAP

Equinix Collaborates With SAP

Equinix, Inc. has announced that it is offering direct and private access to the SAP® Cloud portfolio, including SAP HANA® Enterprise Cloud and SAP Cloud Platform, in multiple markets across the globe. Dedicated, private connections are available via Equinix Cloud Exchange™ and the SAP Cloud Peering service in the Equinix Amsterdam, Frankfurt, Los Angeles, New York, Silicon Valley, Sydney, Toronto and Washington, D.C. International Business Exchange™ (IBX®) data centers, with additional markets planned for later this year. Through this connectivity, enterprise customers benefit from high-performance and secure access to SAP cloud services as part of a hybrid or multi-cloud strategy.

“Equinix recognizes that enterprise cloud needs vary, and by aligning a company’s business requirements to the best cloud services, they can create a more agile, flexible and scalable IT infrastructure.  With more than 130 million cloud subscribers, SAP has a strong foothold in the enterprise market, and by providing these customers and more with dedicated connectivity to their SAP software environments simply, securely and cost-effectively from Equinix Cloud Exchange, we help customers connect and build a hybrid cloud solution that works for them,” said Charles Meyers, president of strategy, services and innovation, Equinix.

As cloud adoption continues to rise, so does the growth of multi-cloud deployments. In fact, according to the recent IDC CloudView survey, 85% of respondents are either currently using a multi-cloud strategy or plan to do so in the near-term.* Equinix Cloud Exchange, with direct access to multiple cloud services and platforms, such as the SAP Cloud portfolio, helps enterprise customers to expedite the development of hybrid and multi-cloud solutions across multiple locations, with the goal of gaining global scale, performance and security.

SAP Cloud Peering provides direct access inside the Equinix Cloud Exchange to help customers looking to reap the benefits of the SAP Cloud portfolio, with the control and predictability of a dedicated connection. Initially, access will be available for SAP HANA Enterprise Cloud and SAP Cloud Platform, which serve as SAP’s IaaS and PaaS solutions respectively. SAP and Equinix plan to make available SAP SuccessFactors®, SAP Hybris®, SAP Ariba®solutions and others in the near future.

“SAP joined the Equinix Cloud Exchange platform to address customer requirements for enterprise hybrid architecture in an environment that lends itself to the very highest levels of performance and reliability. With SAP’s traditional base of more than 300,000 software customers seeking ways to take the next step in a cloud-enabled world, SAP has established efficient capabilities to deliver on those requirements,” said Christoph Boehm senior vice president and head of Cloud Delivery Services, SAP.

SAP continues to gain traction in enterprise cloud adoption, with particular strength in APAC and EMEA. According to a recent 451 Research** note, SAP’s APAC cloud subscription and support revenue grew by 54%, while it rose by 35% in EMEA and by 27% in the Americas.  Access to these cloud-based services in Equinix’s global footprint of data centers will help drive adoption and reach of SAP cloud offerings.

Equinix offers the industry’s broadest choice in cloud service providers, including AWS, Microsoft Azure, Oracle, Google Cloud Platform, and other leading cloud providers such as SAP.  Equinix offers direct connections to many of these platforms via Equinix Cloud Exchange or Equinix Cross Connects. Equinix Cloud Exchange is an advanced interconnection solution that provides virtualized, private direct connections that bypass the Internet to provide better security and performance with a range of bandwidth options. It is currently available in 23 markets, globally.

 

*Source:  IDC CloudView Survey, April 2017. N=6084 worldwide respondents, weighted by country, industry and company size. 
**Source: 451 Research, “SAP hits Q4 and FY2016 targets as cloud subscription/support revenue jumps 31%,” February 1, 2017

Source: CloudStrategyMag

IBM speeds deep learning by using multiple servers

IBM speeds deep learning by using multiple servers

For everyone frustrated by how long it takes to train deep learning models, IBM has some good news: It has unveiled a way to automatically split deep-learning training jobs across multiple physical servers — not just individual GPUs, but whole systems with their own separate sets of GPUs.

Now the bad news: It’s available only in IBM’s PowerAI 4.0 software package, which runs exclusively on IBM’s own OpenPower hardware systems.

Distributed Deep Learning (DDL) doesn’t require developers to learn an entirely new deep learning framework. It repackages several common frameworks for machine learning: TensorFlow, Torch, Caffe, Chainer, and Theano. Deep learning projecs that use those frameworks can then run in parallel across multiple hardware nodes.

IBM claims the speedup gained by scaling across nodes is nearly linear. One benchmark, using the ResNet-101 and ImageNet-22K data sets, needed 16 days to complete on one IBM S822LC server. Spread across 64 such systems, the same benchmark concluded in seven hours, or 58 times faster.

IBM offers two ways to use DDL. One, you can shell out the cash for the servers it’s designed for, which sport two Nvidia Tesla P100 units each, at about $50,000 a head. Two, you can run the PowerAI software in a cloud instance provided by IBM partner Nimbix, for around $0.43 an hour.

One thing you can’t do, though, is run PowerAI on commodity Intel x86 systems. IBM has no plans to offer PowerAI on that platform, citing tight integration between PowerAI’s proprietary components and the OpenPower systems designed to support them. Most of the magic, IBM says, comes from a machine-to-machine software interconnection system that rides on top of whatever hardware fabric is available. Typically, that’s an InfiniBand link, although IBM claims it can also work on conventional gigabit Ethernet (still, IBM admits it won’t run anywhere nearly as fast).

It’s been possible to do deep-learning training on multiple systems in a cluster for some time now, although each framework tends to have its own set of solutions. With Caffe, for example, there’s the Parallel ML System or CaffeOnSpark. TensorFlow can also be distributed across multiple servers, but again any integration with other frameworks is something you’ll have to add by hand.

IBM’s claimed advantage is that it works with multiple frameworks and without as much heavy lifting needed to set things up. But those come at the cost of running on IBM’s own iron.

Source: InfoWorld Big Data

How to avoid big data analytics failures

How to avoid big data analytics failures

Big data and analytics initiatives can be game-changing, giving you insights to help blow past the competition, generate new revenue sources, and better serve customers.

Big data and analytics initiatives can also be colossal failures, resulting in lots of wasted money and time—not to mention the loss of talented technology professionals who become fed up at frustrating management blunders.

How can you avoid big data failures? Some of the best practices are the obvious ones from a basic business management standpoint: be sure to have executive buy-in from the most senior levels of the company, ensure adequate funding for all the technology investments that will be needed, and bring in the needed expertise and/or having good training in place. If you don’t address these basics first, nothing else really matters.

But assuming that you have done the basics, what separates success from failure in big data analytics is how you deal with the technical issues and challenges of big data analytics. Here’s what you can do to stay on the success side of the equation.

Source: InfoWorld Big Data

ZNetLive Rolls Out Managed Microsoft Azure Stack

ZNetLive Rolls Out Managed Microsoft Azure Stack

ZNetLive has announced that it has made available Microsoft Azure Stack — Microsoft’s truly consistent platform for hybrid cloud, to businesses of all sizes with complete deployment and operational support. This will enable enterprises to seamlessly implement and manage their data in a hybrid cloud environment, while realizing its full potential with benefits like agility, scalability, and flexibility, irrespective of their size or cloud expertise of their IT staff.

Azure Stack provides a secure way to enterprises who wish to use Azure public cloud but reasons like data criticality, compliance and data accessibility prevent them from doing so. An extension of Microsoft Azure platform, it helps them to implement same capabilities that Azure public cloud offers, within their own data centers, on premise. It provides them a secure way to have control over their data that’s safe within their boundaries, behind all their security software. This helps them get the benefits of both the public and private cloud worlds.

Microsoft announced the Azure Stack’s ready to order availability in the recently concluded Microsoft partner event – Microsoft Inspire. “We have delivered Azure Stack software to our hardware partners, enabling us to begin the certification process for their integrated systems, with the first systems to begin shipping in September,” wrote Mike Neil, corporate vice president, Azure Infrastructure and Management, in his blog post.

“Azure Stack is another step by Microsoft in fulfilling its Digital Transformation goals. ZNetLive has always been among the fore-runners in bringing the Digital transformation technologies to the end customers and thus, we decided to offer dedicated support services for Microsoft Azure Stack, while attending Microsoft Inspire in Washington D.C, itself. With Microsoft, we take our next step in creating a digitally transformed world,” said Munesh Jadoun, founder & CEO, ZNetLive. 

ZNetLive’s Microsoft Azure Stack management services will provide benefits, including but not limited to these mentioned below:

One support stop for Azure – ZNetLive will provide completely unified support for Azure cloud and Azure Stack cloud including platform elements like hardware, VMs etc. ZNetLive’s Microsoft certified Azure experts will provide round the clock assistance in creating, installing, operating, monitoring and optimizing cloud environments using Azure Stack. 

Assurance and trust with expertise of handling Microsoft Cloud services – Being the first Microsoft Cloud Solution Provider (CSP) in Rajasthan, India, ZNetLive has been working closely with Microsoft for over a decade and has been providing Microsoft cloud management services to benefit the end customers by getting them digitally transformed.  

As its long-standing partner, it has earned many Microsoft partnerships like Cloud OS Network Partner, Gold Hosting, Gold Data Center, Gold Cloud Productivity, just to name a few.

Ensuring secure services – With managed Microsoft Azure Stack, the customers can select state of the art, certified data centers of ZNetLive for hosting their Azure Stack cloud. ZNetLive trained technical experts will take care of the application and infrastructure security with regular health check monitoring and recommending security services to help customers meet regulatory compliances.

 “We have been working with Microsoft Azure for a long time now. Before Microsoft Azure Stack, we used System Center 2012 R2 product suite to create private cloud environments for our customers. This led to an increase in costs due to hardware, licensing, maintenance, upgrades etc.

But now with Azure Stack, we’ll be able to provide our customers the same Azure capabilities in their own data centers at much lesser prices. Since the intuitive interface of Azure Stack is same as that of Azure, the team will be at ease creating VMs, cloud databases, and other Azure cloud services with no additional training,” said Bhupender Singh, chief technical officer, ZNetLive. 

Source: CloudStrategyMag

Report: NetOps And Devops Want More Collaboration In A Multi-Cloud World

Report: NetOps And Devops Want More Collaboration In A Multi-Cloud World

F5 Networks has announced the results of a recent survey comparing the views of over 850 NetOps and DevOps IT professionals on their respective disciplines and collaboration practices. Traditionally, the larger IT market has viewed these two groups as somewhat antagonistic toward one other. However, the F5 survey indicates they are largely aligned on priorities, with converging interests around the production pipeline and automation capabilities. Reconciling survey results with the current trend of DevOps turning to outside solutions (such as shadow IT) to deploy applications, an implication emerges that NetOps will need additional skills to adequately support efforts tied to digital transformation and multi-cloud deployments.

In parallel, F5’s Americas Agility conference takes place this week in Chicago, featuring a dedicated focus toward topics relevant to the interplay between NetOps and DevOps. With hands-on experiences such as technology labs and training classes aimed at helping operations and development professionals take advantage of programmable solutions, the event explores how modern applications are successfully developed, deployed, secured, and supported.

Key Survey Findings

NetOps and DevOps respect each other’s priorities: Within each group, over three-quarters of NetOps and DevOps personnel believe the other function to be prioritizing “the right things” within IT, signaling a common understanding of broader goals, and opportunities to increase collaboration between the teams. In addition, the groups are fairly aligned on the pace that apps and services are delivered, with frequency of deployments satisfying a significant majority of both DevOps (70%) and NetOps (74%) personnel.

Support for automation: Both segments agreed that automation within the production pipeline is important, with an average rating of significance on a 5-point scale of 4.0 from DevOps and 3.5 from NetOps. Respondents also reported more confidence in the reliability, performance, and security of applications when the production pipeline is more than 50% automated.

Dissonance around pipeline access: A difference of opinion surfaced around the ideal level of shared access to production resources. Forty-five percent of DevOps believe they should have access to at least 75% of the production pipeline, with significantly less (31%) of NetOps respondents placing the access figure for DevOps that high, hinting at a partial disconnect surrounding expectations and best practices within IT. This misalignment can hamper efforts to streamline processes and deliver applications the business needs to succeed in a digital economy.

Differences driving multi-cloud deployments: The majority of DevOps (65%) admitted to being influenced toward adopting cloud solutions either “a lot” or “some” by the state of access to the pipeline via automation/self-service capabilities. Related, a significant portion of NetOps (44%) indicated that DevOps’ use of outside cloud technologies affects their desire to provide pipeline access “some,” with an additional 21% stating that it influences them “a lot.” One result of this is the use of multiple cloud solutions and providers across IT, further complicating the process of delivering, deploying, and scaling applications that support digital transformation efforts.

“We see some interesting data points around network- and development-focused personnel,” said Ben Gibson, EVP and chief marketing officer, F5. “While DevOps seeks more open access to the deployment pipeline to drive the speed of innovation, NetOps can be much more cautious around permissions — presumably because they’re the ones that bear the responsibility if security, availability, or performance are compromised. Despite different approaches, both groups support each other’s efforts, and seem to agree that more flexible technologies are needed to overcome current business limitations, bridge disparate functions, and position IT to better leverage public, private, and multi-cloud environments. Overall, neither group’s responses seemed particularly well aligned with the ‘us vs. them’ narrative that has loomed large in the media to date.”

Bridging IT Functions between NetOps and DevOps

Taken together, the survey results point to a rising interest in automation and self-service that can be linked to the rapid adoption of cloud-based solutions, and the desired flexibility they provide. NetOps and DevOps each demonstrate a willingness to introduce emerging technologies and methods into the production pipeline. However, the speed of innovation can also push traditional IT operations teams beyond their current skill levels, contributing potential resistance on the path to streamlined future application rollouts. From the survey, DevOps reports a confidence level of 3.6 on a 5-point scale in terms of if they have the skills their job function requires, with NetOps’ self-assessment yielding a slightly lower figure (3.4).

The survey findings are in step with F5’s belief that enhanced education will play a larger role in bringing these two groups together and rallying around shared goals. To that aim, F5 offers a growing library of industry certification programs that help customers tailor their application delivery infrastructures across related disciplines and provide common frameworks for different roles throughout the organization. With testing available at F5’s Agility conferences and other venues, over 2,500 certifications have been earned in the past year. In addition, F5’s vibrant DevCentral community provides a means for over 250,000 customers, developers, and other IT professionals to pool their collective knowledge, learn from each other’s experiences, and make the most of their technology investments.

Looking forward, F5 is focused on enabling shared empowerment between NetOps and DevOps teammates, concurrent with their use of multi-cloud solutions. The company’s programmable BIG-IP® products, along with adjacent technologies such as its container-focused offerings, provide compelling platforms for evolving IT groups to apply valuable acceleration, availability, and security services to make their applications, users, and operations practices more successful. Further detail on the survey results and methodology can be found in a companion report.

Source: CloudStrategyMag

LockPath Included In Gartner’s 2017 Magic Quadrant For BCMP Solutions

LockPath Included In Gartner’s 2017 Magic Quadrant For BCMP Solutions

LockPath has announced it has been included in Gartner Inc.’s Magic Quadrant for Business Continuity Management Program Solutions, Worldwide.

LockPath was one of 12 vendors included in Gartner Inc.’s report, which was published July 12, by Gartner analysts Roberta Witty and Mark Jaggers. The report, which aims to help organizations evaluate business continuity management program (BCMP) software solutions, recognized LockPath as a challenger in the space for its Keylight Platform.

According to Witty and Jaggers, “The 2017 BCMP solutions market — with an estimated $300 million global market revenue — has broadened its IT disaster recovery management, crisis management and risk management capabilities since 2016.”

LockPath’s Keylight Platform supports enterprise BCMP efforts to identify and mitigate operational risks that could potentially lead to disruption. The platform’s integrated and holistic approach to risk management allows organizations to coordinate efforts across the business to continue operations after serious incidents or disasters.

“We are thrilled to be included in this year’s Magic Quadrant,” said Chris Caldwell, CEO, LockPath. “With the number of threats that can adversely impact operations multiplying, our customers are finding value in including business continuity as part of their overall integrated risk management and GRC programs.”

 

Source: CloudStrategyMag

Report: Shifting Clouds, Surging M&A Shape 2017 Data Center Demand

Report: Shifting Clouds, Surging M&A Shape 2017 Data Center Demand

As consumers turn to their smartphones for everything from streaming video to buying their groceries, the data center industry is stepping up to meet escalating demand for storage. A new report from JLL reveals data center construction in North America is up 43% from 2016 and industry consolidation powered a $10 billion surge in mergers and acquisitions (M&A) in the first half of 2017. Meanwhile, cloud leasing activity started shifting to global markets. 

“While M&A activity is surging, leasing has quietly returned to normal in the U.S.,” said Bo Bond, managing director and data center solutions co-lead, JLL. “The acquisition of large amounts of server space in the U.S. by cloud companies continues, but is no longer as frenetic as it was in 2016. Data center users are now turning their attention toward filling out their global data center footprint and making technology investments to keep them ahead in a rapidly changing industry.”

Data center users investing in the future

In exclusive interviews for JLL’s report, top data center users addressed the hot topics in the industry and how they affect their investment decisions. Users revealed the biggest industry changes coming over the next two years:

  • Efficiency programs will install automation to make data center operations more valuable to the core business.
  • Artificial intelligence will help reduce human intervention in data centers and significantly cut time to restore operations in the event of a failure.
  • Artificial intelligence will make greater use of predictive analytics on-site.
  • Processor technology investments will improve cooling and reduce energy usage.

“Data center users are investing in systems that will allow them to use their servers more efficiently and effectively,” said Mark Bauer, managing director and data center solutions market director, JLL. “Essential technological advancements like artificial intelligence to anticipate failures and automation to reduce response time are what the industry needs to keep up with today’s digital consumer.”

Surprising local market impacts

While data center users are looking to expand their global footprint, North America remains an important location for data storage. In fact, revenue and growth is up for data center companies in a big way in North America. The following markets experienced significant shifts in the first half of 2017:

  • Northern Virginia: Supply is growing at a historic rate, driven by its top-tier status in the data center industry. But with a shortage of available big-block spaces, providers are scrambling to bring new inventory online as quickly as possible to capitalize on the market’s low vacancy and pent-up user demand.
  • Dallas/Fort Worth: The first half of 2017 brought changes to the market, with cloud providers officially setting up shop, spurring a 50 percent bump in absorption. Low power costs will continue to be a major advantage for the market.
  • Northern California: Leasing activity regressed to traditional market levels in the first half of 2017 after large providers drove absorption in the region throughout 2016. Moving forward, construction and occupancy costs will continue to decrease as large blocks of space open up for users.
  • Atlanta: Driven by continued success of both tenured operators and newer operators hitting their stride, the market sustained its strong growth from 2016 during the first half of 2017. Providers and users are now evaluating ways to enter the historically underserved market as they look to anchor their presence in the Southeast.
  • Montréal: Following the raging storm of U.S. cloud activity in 2016, big-name cloud providers swooped in to Montréal in the first half of 2017. The timing is right for providers to enter the Canadian market and take advantage of its optimal pricing and low power rates.

For more insights on data center industry performance in the first half of 2017, with research from data center hub markets across the U.S. and Canada, download the report.

Source: CloudStrategyMag

How to use Redis for real-time stream processing

How to use Redis for real-time stream processing

Real-time streaming data ingest is a common requirement for many big data use cases. In fields like IoT, e-commerce, security, communications, entertainment, finance, and retail, where so much depends on timely and accurate data-driven decision making, real-time data collection and analysis are in fact core to the business.

However, collecting, storing and processing streaming data in large volumes and at high velocity presents architectural challenges. An important first step in delivering real-time data analysis is ensuring that adequate network, compute, storage, and memory resources are available to capture fast data streams. But a company’s software stack must match the performance of its physical infrastructure. Otherwise, businesses will face a massive backlog of data, or worse, missing or incomplete data.

Redis has become a popular choice for such fast data ingest scenarios. A lightweight in-memory database platform, Redis achieves throughput in the millions of operations per second with sub-millisecond latencies, while drawing on minimal resources. It also offers simple implementations, enabled by its multiple data structures and functions.

In this article, I will show how Redis Enterprise can solve common challenges associated with the ingestion and processing of large volumes of high velocity data. We’ll walk through three different approaches (including code) to processing a Twitter feed in real time, using Redis Pub/Sub, Redis Lists, and Redis Sorted Sets, respectively. As we’ll see, all three methods have a role to play in fast data ingestion, depending on the use case.

Challenges in designing fast data ingest solutions

High-speed data ingestion often involves several different types of complexity:

  • Large volumes of data sometimes arriving in bursts. Bursty data requires a solution that is capable of processing large volumes of data with minimal latency. Ideally, it should be able to perform millions of writes per second with sub-millisecond latency, using minimal resources.
  • Data from multiple sources. Data ingest solutions must be flexible enough to handle data in many different formats, retaining source identity if needed and transforming or normalizing in real-time.
  • Data that needs to be filtered, analyzed, or forwarded. Most data ingest solutions have one or more subscribers who consume the data. These are often different applications that function in the same or different locations with a varied set of assumptions. In such cases, the database not only needs to transform the data, but also filter or aggregate depending on the requirements of the consuming applications.
  • Data coming from geographically distributed sources. In this scenario, it is often convenient to distribute the data collection nodes, placing them close to the sources. The nodes themselves become part of the fast data ingest solution, to collect, process, forward, or reroute ingest data.

Handling fast data ingest in Redis

Many solutions supporting fast data ingest today are complex, feature-rich, and over-engineered for simple requirements. Redis, on the other hand, is extremely lightweight, fast, and easy to use. With clients available in more than 60 languages, Redis can be easily integrated with the popular software stacks.

Redis offers data structures such as Lists, Sets, Sorted Sets, and Hashes that offer simple and versatile data processing. Redis delivers more than a million read/write operations per second, with sub-millisecond latency on a modestly sized commodity cloud instance, making it extremely resource-efficient for large volumes of data. Redis also supports messaging services and client libraries in all of the popular programming languages, making it well-suited for combining high-speed data ingest and real-time analytics. Redis Pub/Sub commands allow it to play the role of a message broker between publishers and subscribers, a feature often used to send notifications or messages between distributed data ingest nodes.

Redis Enterprise enhances Redis with seamless scaling, always-on availability, automated deployment, and the ability to use cost-effective flash memory as a RAM extender so that the processing of large datasets can be accomplished cost-effectively.

In the sections below, I will outline how to use Redis Enterprise to address common data ingest challenges.

Redis at the speed of Twitter

To illustrate the simplicity of Redis, we’ll explore a sample fast data ingest solution that gathers messages from a Twitter feed. The goal of this solution is to process tweets in real-time and push them down the pipe as they are processed.

Twitter data ingested by the solution is then consumed by multiple processors down the line. As shown in Figure 1, this example deals with two processors – the English Tweet Processor and the Influencer Processor. Each processor filters the tweets and passes them down its respective channels to other consumers. This chain can go as far as the solution requires. However, in our example, we stop at the third level, where we aggregate popular discussions among English speakers and top influencers.

redis twitter streamRedis Labs

Figure 1. Flow of the Twitter stream

Note that we are using the example of processing Twitter feeds because of the velocity of data arrival and simplicity. Note also that Twitter data reaches our fast data ingest via a single channel. In many cases, such as IoT, there could be multiple data sources sending data to the main receiver.

There are three possible ways to implement this solution using Redis: ingest with Redis Pub/Sub, ingest with the List data structure, or ingest with the Sorted Set data structure. Let’s examine each of these options.

Ingest with Redis Pub/Sub

This is the simplest implementation of fast data ingest. This solution uses Redis’s Pub/Sub feature, which allows applications to publish and subscribe to messages. As shown in Figure 2, each stage processes the data and publishes it to a channel. The subsequent stage subscribes to the channel and receives the messages for further processing or filtering.

redis pubsubRedis Labs

Figure 2. Data ingest using Redis Pub/Sub

Pros

  • Easy to implement.
  • Works well when the data sources and processors are distributed geographically.

Cons 

  • The solution requires the publishers and subscribers to be up all the time. Subscribers lose data when stopped, or when the connection is lost.
  • It requires more connections. A program cannot publish and subscribe to the same connection, so each intermediate data processor requires two connections – one to subscribe and one to publish. If running Redis on a DBaaS platform, it is important to verify whether your package or level of service has any limits to the number of connections.

A note about connections

If more than one client subscribes to a channel, Redis pushes the data to each client linearly, one after the other. Large data payloads and many connections may introduce latency between a publisher and its subscribers. Although the default hard limit for maximum number of connections is 10,000, you must test and benchmark how many connections are appropriate for your payload.

Redis maintains a client output buffer for each client. The default limits for the client output buffer for Pub/Sub are set as:

client-output-buffer-limit pubsub 32mb 8mb 60

With this setting, Redis will force clients to disconnect under two conditions: if the output buffer grows beyond 32MB, or if the output buffer holds 8MB of data consistently for 60 seconds.

These are indications that clients are consuming the data more slowly than it is published. Should such a situation arise, first try optimizing the consumers such that they do not add latency while consuming the data. If you notice that your clients are still getting disconnected, then you may increase the limits for the client-output-buffer-limit pubsub property in redis.conf. Please keep in mind that any changes to the settings may increase latency between the publisher and subscriber. Any changes must be tested and verified thoroughly.

Code design for the Redis Pub/Sub solution

redis pubsub class diagramRedis Labs

Figure 3. Class diagram of the fast data ingest solution with Redis Pub/Sub

This is the simplest of the three solutions described in this paper. Here are the important Java classes implemented for this solution. Download the source code with full implementation here: https://github.com/redislabsdemo/IngestPubSub.

The Subscriber class is the core class of this design. Every Subscriber object maintains a new connection with Redis.

class Subscriber extends JedisPubSub implements Runnable{
private String name ="Subscriber";
private RedisConnection conn = null;
private Jedis jedis = null;

private String subscriberChannel ="defaultchannel";

public Subscriber(String subscriberName, String channelName) throws Exception{
name = subscriberName;
subscriberChannel = channelName;
Thread t = new Thread(this);
t.start();
}

@Override
public void run(){
try{
conn = RedisConnection.getRedisConnection();
jedis = conn.getJedis();
while(true){
jedis.subscribe(this, this.subscriberChannel);
}
}catch(Exception e){
e.printStackTrace();
}
}

@Override
public void onMessage(String channel, String message){
super.onMessage(channel, message);
}
}

The Publisher class maintains a separate connection to Redis for publishing messages to a channel.

public class Publisher{

RedisConnection conn = null;
Jedis jedis = null;

private String channel ="defaultchannel";

public Publisher(String channelName) throws Exception{
channel = channelName;
conn = RedisConnection.getRedisConnection();
jedis = conn.getJedis();
}

public void publish(String msg) throws Exception{
jedis.publish(channel, msg);
}
}

The EnglishTweetFilter, InfluencerTweetFilter, HashTagCollector, and InfluencerCollector filters extend Subscriber, which enables them to listen to the inbound channels. Since you need separate Redis connections for subscribe and publish, each filter class has its own RedisConnection object. Filters listen to the new messages in their channels in a loop. Here is the sample code of the EnglishTweetFilter class:

public class EnglishTweetFilter extends Subscriber
{

private RedisConnection conn = null;
private Jedis jedis = null;
private String publisherChannel = null;

public EnglishTweetFilter(String name, String subscriberChannel, String publisherChannel) throws Exception{
super(name, subscriberChannel);
this.publisherChannel = publisherChannel;
conn = RedisConnection.getRedisConnection();
jedis = conn.getJedis();
}

@Override
public void onMessage(String subscriberChannel, String message){
JsonParser jsonParser = new JsonParser();
JsonElement jsonElement = jsonParser.parse(message);
JsonObject jsonObject = jsonElement.getAsJsonObject();

//filter messages: publish only English tweets
if(jsonObject.get(“lang”) != null &&
jsonObject.get(“lang”).getAsString().equals(“en”)){
jedis.publish(publisherChannel, message);
}
}
}

The Publisher class has a publish method that publishes messages to the required channel.

public class Publisher{
.
.
public void publish(String msg) throws Exception{
jedis.publish(channel, msg);
}
.
}

The main class reads data from the ingest stream and posts it to the AllData channel. The main method of this class starts all of the filter objects.

public class IngestPubSub
{
.
public void start() throws Exception{
.
.
publisher = new Publisher(“AllData”);

englishFilter = new EnglishTweetFilter(“English Filter”,”AllData”,
“EnglishTweets”);
influencerFilter = new InfluencerTweetFilter(“Influencer Filter”,
“AllData”, “InfluencerTweets”);
hashtagCollector = new HashTagCollector(“Hashtag Collector”,
“EnglishTweets”);
influencerCollector = new InfluencerCollector(“Influencer Collector”,
“InfluencerTweets”);
.
.
}

Ingest with Redis Lists

The List data structure in Redis makes implementing a queueing solution easy and straightforward. In this solution, the producer pushes every message to the back of the queue, and the subscriber polls the queue and pulls new messages from the other end.

redis listsRedis Labs

Figure 4. Fast data ingest with Redis Lists

Pros

  • This method is reliable in cases of connection loss. Once data is pushed into the lists, it is preserved there until the subscribers read it. This is true even if the subscribers are stopped or lose their connection with the Redis server.
  • Producers and consumers require no connection between them.

Cons

  • Once data is pulled from the list, it is removed and cannot be retrieved again. Unless the consumers persist the data, it is lost as soon as it is consumed.
  • Every consumer requires a separate queue, which requires storing multiple copies of the data.

Code design for the Redis Lists solution

redis lists class diagramRedis Labs

Figure 5. Class diagram of the fast data ingest solution with Redis Lists

You can download the source code for the Redis Lists solution here: https://github.com/redislabsdemo/IngestList. This solution’s main classes are explained below.

MessageList embeds the Redis List data structure. The push() method pushes the new message to the left of the queue, and pop() waits for a new message from the right if the queue is empty.

public class MessageList{

protected String name = “MyList”; // Name
.
.
public void push(String msg) throws Exception{
jedis.lpush(name, msg); // Left Push
}

public String pop() throws Exception{
return jedis.brpop(0, name).toString();
}
.
.
}

MessageListener is an abstract class that implements listener and publisher logic. A MessageListener object listens to only one list, but can publish to multiple channels (MessageFilter objects). This solution requires a separate MessageFilter object for each subscriber down the pipe.

class MessageListener implements Runnable{
private String name = null;
private MessageList inboundList = null;
Map<String, MessageFilter> outBoundMsgFilters = new HashMap<String, MessageFilter>();
.
.
public void registerOutBoundMessageList(MessageFilter msgFilter){
if(msgFilter != null){
if(outBoundMsgFilters.get(msgFilter.name) == null){
outBoundMsgFilters.put(msgFilter.name, msgFilter);
}
}
}

.
.
@Override
public void run(){
.
while(true){
String msg = inboundList.pop();
processMessage(msg);
}
.
}

.
protected void pushMessage(String msg) throws Exception{
Set<String> outBoundMsgNames = outBoundMsgFilters.keySet();
for(String name : outBoundMsgNames ){
MessageFilter msgList = outBoundMsgFilters.get(name);
msgList.filterAndPush(msg);
}
}
}

MessageFilter is a parent class facilitating the filterAndPush() method. As data flows through the ingest system, it is often filtered or transformed before being sent to the next stage. Classes that extend the MessageFilter class override the filterAndPush() method, and implement their own logic to push the filtered message to the next list.

public class MessageFilter{

MessageList messageList = null;
.
.
public void filterAndPush(String msg) throws Exception{
messageList.push(msg);
}
.
.
}

AllTweetsListener is a sample implementation of a MessageListener class. This listens to all tweets on the AllData channel, and publishes the data to EnglishTweetsFilter and InfluencerFilter.

public class AllTweetsListener extends MessageListener{
.
.
public static void main(String[] args) throws Exception{
MessageListener allTweetsProcessor = AllTweetsListener.getInstance();

allTweetsProcessor.registerOutBoundMessageList(new
EnglishTweetsFilter(“EnglishTweetsFilter”, “EnglishTweets”));
allTweetsProcessor.registerOutBoundMessageList(new
InfluencerFilter(“InfluencerFilter”, “Influencers”));

allTweetsProcessor.start();
}
.
.
}

EnglishTweetsFilter extends MessageFilter. This class implements logic to select only those tweets that are marked as English tweets. The filter discards non-English tweets and pushes English tweets to the next list.

public class EnglishTweetsFilter extends MessageFilter{

public EnglishTweetsFilter(String name, String listName) throws Exception{
super(name, listName);
}

@Override
public void filterAndPush(String message) throws Exception{
JsonParser jsonParser = new JsonParser();

JsonElement jsonElement = jsonParser.parse(message);
JsonArray jsonArray = jsonElement.getAsJsonArray();
JsonObject jsonObject = jsonArray.get(1).getAsJsonObject();
if(jsonObject.get(“lang”) != null &&
jsonObject.get(“lang”).getAsString().equals(“en”)){
Jedis jedis = super.getJedisInstance();
if(jedis != null){
jedis.lpush(super.name, jsonObject.toString());

}
}
}
}

Ingest using Redis Sorted Sets

One of the concerns with the Pub/Sub method is that it is susceptible to connection loss and hence unreliable. The challenge with the Redis Lists solution is the problem of data duplication and tight coupling between producers and consumers.

The Redis Sorted Sets solution addresses both of these issues. A counter tracks the number of messages, and the messages are indexed against this message count. They are stored in a non-ephemeral state inside the Sorted Sets data structure, which is polled by consumer applications. The consumers check for new data and pull messages by running the ZRANGEBYSCORE command.

redis sorted setsRedis Labs

Figure 6. Fast data ingest with Redis Sorted Sets and Pub/Sub

Unlike the previous two solutions, this one allows subscribers to retrieve historical data when needed, and consume it more than once. Only one copy of data is stored at each stage, making it ideal for situations where the consumer to producer ratio is very high. However, this approach is more complex and less cost-effective when compared with the last two solutions.

Pros

  • It can fetch historical data when needed, because retrieved data is not removed from the Sorted Set.
  • The solution is resilient to data connection losses, because producers and consumers require no connection between them.
  • Only one copy of data is stored at each stage, making it ideal for situations where the consumer to producer ratio is very high.

Cons

  • Implementing the solution is more complex.
  • More storage space is required, as data is not deleted from the database when consumed. 

Code design for the Redis Sorted Sets solution

redis sorted sets class diagramRedis Labs

Figure 7. Class diagram of the fast data ingest solution with Redis Sorted Sets

You can download the source code here: https://github.com/redislabsdemo/IngestSortedSet. The main classes are explained below.

SortedSetPublisher inserts a message into a Sorted Set and increments the counter that tracks new messages. In many practical cases the counter can be replaced by the timestamp.

public class SortedSetPublisher
{

public static String SORTEDSET_COUNT_SUFFIX ="count";

// Redis connection
RedisConnection conn = null;

// Jedis object
Jedis jedis = null;

// name of the Sorted Set data structure
private String sortedSetName = null;

/*
* @param name: SortedSetPublisher constructor
*/
public SortedSetPublisher(String name) throws Exception{
sortedSetName = name;
conn = RedisConnection.getRedisConnection();
jedis = conn.getJedis();
}

/*
*/
public void publish(String message) throws Exception{
// Get count
long count = jedis.incr(sortedSetName+”:”+SORTEDSET_COUNT_SUFFIX);

// Insert into sorted set
jedis.zadd(sortedSetName, (double)count, message);
}

}

The SortedSetFilter class is a parent class that implements logic to learn about new messages, pull them from the database, filter them, and push them to the next level. Classes that implement custom filters extend this class and override the processMessage() method with a custom implementation.

public class SortedSetFilter extends Thread
{
// RedisConnection to query the database
protected RedisConnection conn = null;

protected Jedis jedis = null;

protected String name ="SortedSetSubscriber"; // default name

protected String subscriberChannel ="defaultchannel"; //default name

// Name of the Sorted Set
protected String sortedSetName = null;

// Channel (sorted set) to publish
protected String publisherChannel = null;

// The key of the last message processed
protected String lastMsgKey = null;

// The key of the latest message count
protected String currentMsgKey = null;

// Count to store the last message processed
protected volatile String lastMsgCount = null;

// Time-series publisher for the next level
protected SortedSetPublisher SortedSetPublisher = null;

public static String LAST_MESSAGE_COUNT_SUFFIX="lastmessage";

/*
* @param name: name of the SortedSetFilter object
* @param subscriberChannel: name of the channel to listen to the
* availability of new messages
* @param publisherChannel: name of the channel to publish the availability of
* new messages
*/
public SortedSetFilter(String name, String subscriberChannel,
String publisherChannel) throws Exception{
this.name = name;
this.subscriberChannel = subscriberChannel;
this.sortedSetName = subscriberChannel;
this.publisherChannel = publisherChannel;
this.lastMsgKey = name+”:”+LAST_MESSAGE_COUNT_SUFFIX;
this.currentMsgKey =
subscriberChannel+”:”
+SortedSetPublisher.SORTEDSET_COUNT_SUFFIX;
}

@Override
public void run(){
try{

// Connection for reading/writing to sorted sets
conn = RedisConnection.getRedisConnection();
jedis = conn.getJedis();
if(publisherChannel != null){
sortedSetPublisher =
new SortedSetPublisher(publisherChannel);
}

// load delta data since last connection
while(true){
fetchData();
}
}catch(Exception e){
e.printStackTrace();
}
}

/*
* init() method loads the count of the last message processed. It then loads
* all messages since the last count.
*/
private void fetchData() throws Exception{
if(lastMsgCount == null){
lastMsgCount = jedis.get(lastMsgKey);
if(lastMsgCount == null){
lastMsgCount ="0";
}
}

String currentCount = jedis.get(currentMsgKey);

if(currentCount != null && Long.parseLong(currentCount) >
Long.parseLong(lastMsgCount)){
loadSortedSet(lastMsgCount, currentCount);
}else{
Thread.sleep(1000); // sleep for a second if there’s no
// data to fetch
}
}

//Call to load the data from last Count to current Count
private void loadSortedSet(String lastMsgCount, String currentCount)
throws Exception{
//Read from SortedSet
Set<Tuple> CountTuple = jedis.zrangeByScoreWithScores(sortedSetName, lastMsgCount, currentCount);
for(Tuple t : CountTuple){
processMessageTuple(t);
}

}

// Override this method to customize the filters
private void processMessageTuple(Tuple t) throws Exception{
long score = new Double(t.getScore()).longValue();
String message = t.getElement();
lastMsgCount = (new Long(score)).toString();
processMessage(message);

jedis.set(lastMsgKey, lastMsgCount);
}

protected void processMessage(String message) throws Exception{
//Override this method
}
}

EnglishTweetsFilter is a custom filter that extends SortedSetFilter with its own custom filter to select only tweets that are marked as English.

public class EnglishTweetsFilter extends SortedSetFilter
{
/*
* @param name: name of the SortedSetFilter object
* @param subscriberChannel: name of the channel to listen to the
* availability of new messages
* @param publisherChannel: name of the channel to publish the availability
* of new messages
*/
public EnglishTweetsFilter(String name, String subscriberChannel, String publisherChannel) throws Exception{
super(name, subscriberChannel, publisherChannel);
}

@Override
protected void processMessage(String message) throws Exception{
//Filter; add them to a new time series database and publish
JsonParser jsonParser = new JsonParser();

JsonElement jsonElement = jsonParser.parse(message);
JsonObject jsonObject = jsonElement.getAsJsonObject();

if(jsonObject.get(“lang”) != null &&
jsonObject.get(“lang”).getAsString().equals(“en”)){
System.out.println(jsonObject.get(“text”).getAsString());
if(sortedSetPublisher != null){
sortedSetPublisher.publish(jsonObject.toString());

}
}
}

/*
* Main method to start EnglishTweetsFilter
*/
public static void main(String[] args) throws Exception{
EnglishTweetsFilter englishFilter = new EnglishTweetsFilter(“EnglishFilter”, “alldata”, “englishtweets”);
englishFilter.start();
}

Final thoughts

When using Redis for fast data ingest, its data structures and pub/sub functionality offer a number of options for implementation. Each approach has its advantages and disadvantages. Redis Pub/Sub is easy to implement, and producers and consumers are decoupled. But Pub/Sub is not resilient to connection loss, and it requires many connections. It’s typically used for e-commerce workflows, job and queue management, social media communications, gaming, and log collection.

The Redis Lists method is also easy to implement, and unlike with Pub/Sub, data is not lost when the subscriber loses the connection. Disadvantages include tight coupling of producers and consumers and the duplication of data for each consumer, which makes it unsuitable for some scenarios. Suitable use cases would include financial transactions, gaming, social media, IoT, and fraud detection.

The Redis Sorted Sets has a larger footprint and is more complex to implement and maintain than the Pub/Sub and List methods, but overcomes their limitations. It is resilient to connection loss, and because retrieved data is not removed from the Sorted Set, it allows for time-series queries. And because only one copy of the data is stored at each stage, it is very efficient in cases where one producer has many consumers. The Sorted Sets method is a good match for IoT transactions, financial transactions, and metering.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data

Report: Federal IT Pulling Back From Pure-Play Public Cloud Infrastructures

Report: Federal IT Pulling Back From Pure-Play Public Cloud Infrastructures

Cloud savvy Federal IT decision makers are opting for hybrid cloud models over pure-play public cloud infrastructures as they seek to modernize and secure government systems, according to an independent survey underwritten by Nutanix. The survey, conducted by Market Connections, Inc., drove several key findings. Cost savings using a public-only approach, while possible, have not lived up to the initial hype of cloud computing. While 39% of public cloud users indicated that cost savings are considered ‘great’, the majority of respondents (61%) noted minimal results, ranging from ‘some savings’ to ‘no savings’ at all. Respondents also noted that not every workload is optimal to run in a public cloud, with financials (43%), custom or mission-specific applications (36%) and human resources/ERP applications (34%) considered the least suited for the public cloud.

The most surprising result was that, as a group, more experienced public cloud users forecasted increasing the proportion of application workloads that they run in their private clouds over the next two years, indicating that more experienced cloud users are increasingly leveraging hybrid models to optimize their environments.

“Federal agencies are realizing that a wholesale move to the public cloud is not always the best approach to meet their desired outcomes,” said Chris Howard, Vice President of Federal, Nutanix. “There is a clear opportunity to achieve the benefits of cloud with a hybrid approach, keeping predictable application workloads on-prem and using public cloud for dynamic applications that require extra capacity for finite periods of time.”

The survey of 150 defense, civilian, and intelligence agency IT decision makers sought to determine whether the move to cloud computing has fulfilled agency expectations since the Cloud First Mandate was issued in 2010. Key areas of focus for the study were cost savings, security and applicability of cloud for all application workloads.

The blind online survey was comprised of Department of Defense, military service or intelligence agency respondents (45%), and federal civilian or independent government agencies, including legislative and judicial respondents (55%). All respondents were familiar with their agency’s cloud usage.

To access the full report and survey results, please visit http://www.nutanix.com/FedStudy.

Source: CloudStrategyMag