R tutorial: Learn to crunch big data with R

R tutorial: Learn to crunch big data with R

A few years ago, I was the CTO and cofounder of a startup in the medical practice management software space. One of the problems we were trying to solve was how medical office visit schedules can optimize everyone’s time. Too often, office visits are scheduled to optimize the physician’s time, and patients have to wait way too long in overcrowded waiting rooms in the company of people coughing contagious diseases out their lungs.

One of my cofounders, a hospital medical director, had a multivariate linear model that could predict the required length for an office visit based on the reason for the visit, whether the patient needs a translator, the average historical visit lengths of both doctor and patient, and other possibly relevant factors. One of the subsystems I needed to build was a monthly regression task to update all of the coefficients in the model based on historical data. After exploring many options, I chose to implement this piece in R, taking advantage of the wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques implemented in the R system.

One of the attractions for me was the R scripting language, which makes it easy to save and rerun analyses on updated data sets; another attraction was the ability to integrate R and C++. A key benefit for this project was the fact that R, unlike Microsoft Excel and other GUI analysis programs, is completely auditable.

Alas, that startup ran out of money not long after I implemented a proof-of-concept web application, at least partially because our first hospital customer had to declare Chapter 7 bankruptcy. Nevertheless, I continue to favor R for statistical analysis and data science.

Source: InfoWorld Big Data

IDG Contributor Network: Your analytics strategy is obsolete

IDG Contributor Network: Your analytics strategy is obsolete

In the information age, it’s the data-driven bird that gets the worm. Giant companies like Google, Facebook, and Apple hoard data, because it’s the information equivalent of gold.

But merely hoarding data isn’t enough. You need to be adept at sifting through, tying together, and making sense of all the data spilling out of your data lakes. Only then can you act on data to make better decisions and build smarter products.

Yet in the crowded and overfunded analytics market, seeing through the stupefying vendor smog can be all but impossible. To help you make sense of the vast and confusing analytics space, I’ve put together a list of my top predictions for the next five years.

With any luck, these predictions will help you steer your organization toward data-driven bliss.

1. BI migrates into apps

For the past 20 years, we’ve been witnessing a revolution. Not the kind that happens overnight, but the kind that happens gradually. So slowly, in fact, you may not have noticed.

BI is dying. Or more precisely, BI is transmogrifying.

Tableau, a 20-year-old company, was the last “BI” company to sprout a unicorn horn. And let’s be honest, Tableau is not really a bread-and-butter BI solution—it’s a data visualization tool that acquired enough BI sparkle to take on the paleolithic Goliaths that formerly dominated the industry.

Every year, users are gorging themselves on more and more analytics through the apps they use, like HubSpot, SalesForce, and MailChimp. Analytics is migrating into the very fabric of the business applications.

In essence, business applications are acquiring their own analytics interfaces, custom-tailored to their data and their use cases. This integration and customization makes the analytic interfaces more accessible to users than esoteric, complex general-purpose BI (though at the cost of increasing data silos and making the big picture harder to see).

This trend will continue as B2B apps everywhere start competing on data intelligence offerings (those chintzy one-page analytics dashboards are a relic of the past).

2. Compilers over engines

Historically, fresh and tasty analytics were served up two ways: by precomputation (when common aggregations are precomputed and stored in-memory, like in OLAP engines), or by analytics engines (including analytic databases like Teradata and Vertica).

Analytics engines, like Spark and the data engine in Tableau, are responsible for performing the computation required to answer key questions over an organization’s data.

Now there’s a new player on the scene: the analytics compiler. Analytic compilers can flexibly deploy computations to different infrastructures. Examples of analytic compilers include the red hot TensorFlow, which can deploy computations to GPUs or CPUs, Drill, and Quasar Analytics.

Compilers are vastly more flexible than engines because they can take number-crunching recipes and translate them to run in different infrastructures (in-database, on Spark, in a GPU, whatever!). Compilers can also, in theory, generate workflows that run way faster than any interpreted engine.

Even Spark has been acquiring basic compilation facilities, which is a sure sign that compilers are here to stay, and may eventually eclipse legacy pure computational engines.

3. ETL diversifies

Few data acronyms can strike more fear into the hearts of executives than the dreaded “ETL.” Extract-transform-load is the necessary evil by which piles of incomplete, duplicated, unrelated, messy slop is pulled out, cleaned up, and shoved into somewhere the data Vulcans can mind-meld with it.

ETL is the antithesis of modern, agile, and data-driven. ETL means endlessly replicated data, countless delays, and towering expenses. It means not being able to answer the questions that matter when they matter.

In an attempt to make ETL more agile, the industry has developed a variety of alternatives, most heavily funded at the moment by venture capital. These solutions range from high-level ETL tools that make it easier to do ETL into Hadoop or a data warehouse, to streaming ETL solutions, to ETL solutions that leverage machine learning to cross-reference and deduplicate.

Another very interesting class of technology includes tools like Dremio and Xcalar, which reimagine ETL as extract-load-transform (or ELT). In essence, they push transformation to the end and make it lazy, so you don’t have to do any upfront extraction, loading, or transformation.

Historically, ELT has been slow, but these next-generation solutions make ELT fast by dynamically reshaping, indexing, and caching common transformations. This gives you the performance of traditional ETL, with the flexibility of late-stage transformations.

No matter how you slice it, ETL is undergoing dramatic evolution that will make it easier than ever for organizations to rapidly leverage data without time-consuming and costly upfront investments in IT.

4. Data silos open up

The big problems at big organizations don’t really involve fancy analytics. Most companies can’t even add up and count their data. Not because sums and counts are hard, but because data in a modern organization is fragmented and scattered in ten thousand silos.

Thanks to the cloud (including the API revolution and managed data solutions) and recent advances in ETL, it’s becoming easier than ever for organizations to access more of their data in a structured way.

Next-generation data management solutions will play an important role in leveraging these technological advances to make all of an organization’s data analytically accessible to all the right people in a timely fashion.

5. Machine learning gets practical

Machine learning is just past the peak of the hype cycle. Or at least we can hope so. Unnamed tech celebrities who don’t understand how machine learning works continue to rant about doomsday Terminator scenarios, even while consumers can’t stop joking about how terrible Siri is.

Machine learning suffers from a fatal combination of imperfection and inculpability. When machine learning goes wrong (as it often and inevitably does), there’s no one to blame, and no one to learn from the mistake.

That’s an absolute no-no for any kind of mission-critical analytics.

So until we are able to train artificial minds on the entirety of knowledge absorbed by society’s brightest, the magical oracle that can answer any question over the data of a business is very far off. Much farther than five years.

Until then, we are likely to see very focused applications of machine learning. For example, ThoughtSpot’s natural language interface to BI; black-box predictive analytics for structured data sets; and human-assistive technology that lets people see connections between different data sources, correct common errors, and spot anomalies.

These aren’t the superbrains promised in science fiction, but they will make it easier for users to figure out what questions to ask and help guide them toward finding correct answers.

While analytics is a giant market and filled with confusing marketing speak, there are big trends shaping the industry that will dictate where organizations invest.

These trends include the ongoing migration of data intelligence into business applications, the advent of analytic compilers that can deploy workflows to ad hoc infrastructure, the rapidly evolving state of ETL, the increased accessibility of data silos to organizations, and the very pragmatic if unsensational ways that machine learning is improving analytics tools.

These overarching trends for the next five years will ripple into the tools that organizations adopt, the analytic startups that get funded, the acquisitions that existing players make, and the innovation that we see throughout the entire analytic stack, from data warehouse to visual analytics front-ends.

When figuring out what your data architecture and technology stack should look like, choose wisely, because the industry is in the process of reinvention, and few stones will be left unturned.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: ETL is dead

IDG Contributor Network: ETL is dead

Extract, transform, and load. It doesn’t sound too complicated. But, as anyone who’s managed a data pipeline will tell you, the simple name hides a ton of complexity.

And while none of the steps are easy, the part that gives data engineers nightmares is the transform. Taking raw data, cleaning it, filtering it, reshaping it, summarizing it, and rolling it up so that it’s ready for analysis. That’s where most of your time and energy goes, and it’s where there’s the most room for mistakes.

If ETL is so hard, why do we do it this way?

The answer, in short, is because there was no other option. Data warehouses couldn’t handle the raw data as it was extracted from source systems, in all its complexity and size. So the transform step was necessary before you could load and eventually query data. The cost, however, was steep.

Rather than maintaining raw data that could be transformed into any possible end product, the transform shaped your data into an intermediate form that was less flexible. You lost some of the data’s resolution, imposed the current version of your business’ metrics on the data, and threw out useless data.

And if any of that changed—if you needed hourly data when previously you’d only processed daily data, if your metric definitions changed, or some of that “useless” data turned out to not be so useless after all—then you’d have to fix your transformation logic, reprocess your data, and reload it.

The fix might take days or weeks

It wasn’t a great system, but it’s what we had.

So as technologies change and prior constraints fall away, it’s worth asking what we would do in an ideal world—one where data warehouses were infinitely fast and could handle data of any shape or size. In that world, there’d be no reason to transform data before loading it. You’d extract it and load it in its rawest form.

You’d still want to transform the data, because querying low-quality, dirty data isn’t likely to yield much business value. But your infinitely fast data warehouse could handle that transformation right at query time. The transformation and query would all be a single step. Think of it as just-in-time transformation. Or ELT.

The advantage of this imaginary system is clear: You wouldn’t have to decide ahead of time which data to discard or which version of your metric definitions to use. You’d always use the freshest version of your transformation logic, giving you total flexibility and agility.

So, is that the world we live in? And if so, should we switch to ELT?

Not quite. Data warehouses have indeed gotten several orders of magnitude faster and cheaper. Transformations that used to take hours and cost thousands of dollars now take seconds and cost pennies. But they can still get bogged down with misshapen data or huge processes.

So there’s still some transformation that’s best accomplished outside the warehouse. Removing irrelevant or dirty data, and doing heavyweight reshaping, is still often a preloading process. But this initial transform is a much smaller step and thus much less likely to need updating down the road.

Basically, it’s gone from a big, all-encompassing ‘T’ to a much smaller ‘t’

Once the initial transform is done, it’d be nice to move the rest of the transform to query time. But especially with larger data volumes, the data warehouses still aren’t quite fast enough to make that workable. (Plus, you still need a good way to manage the business logic and impose it as people query.)

So instead of moving all of that transformation to query time, more and more companies are doing most of it in the data warehouse—but they’re doing it immediately after loading. This gives them lots more agility than in the old system, but maintains tolerable performance. For now, at least, this is where the biggest “T” is happening.

The lightest-weight transformations—the ones the warehouses can do very quickly—are happening right at query time. This represents another small “t,” but it has a very different focus than the preloading “t.” That’s because these lightweight transformations often involve prototypes of new metrics and more ad hoc exploration, so the total flexibility that query-time transformation provides is ideal.

In short, we’re seeing a huge shift that takes advantage of new technologies to make analytics more flexible, more responsive, and more performant. As a result, employees are making better decisions using data that was previously slow, inaccessible, or worst of all, wrong. And the companies that embrace this shift are outpacing rivals stuck in the old way of doing things.

ETL? ETL is dead. But long live … um … EtLTt?

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Cisco Updates Its SDN Solution

Cisco Updates Its SDN Solution

Cisco has announced updates to its Application Centric Infrastructure (Cisco ACI™), a software-defined networking (SDN) solution deigned to make it easier for customers to adopt and advance intent-based networking for their data centers. With the latest software release (ACI 3.0), more than 4,000 ACI customers can increase business agility with network automation, simplified management, and improved security for any combination of workloads in containers, virtual machines and bare metal for private clouds, and on-premise data centers.

The transitions occurring in the data center are substantial. Enterprises experience an unrelenting need to accelerate speed, flexibility, security and scale across increasingly complex data centers and multi-cloud environments.

“As our customers shift to multi-cloud strategies, they are seeking ways to simplify the management and scalability of their environments,” said Ish Limkakeng, senior vice president for data center networking at Cisco. “By automating basic IT operations with a central policy across multiple data centers and geographies, ACI’s new multi-site management capability helps network operators more easily move and manage workloads with a single pane of glass — a significant step in delivering on Cisco’s vision for enabling ACI Anywhere.”

The new ACI 3.0 software release is now available. New features include:

Multi-site Management: Customers can seamlessly connect and manage multiple ACI fabrics that are geographically distributed to improve availability by isolating fault domains, and provide a global view of network policy through a single management portal. This greatly simplifies disaster recovery and the ability to scale out applications.

Kubernetes Integration: Customers can deploy their workloads as micro-services in containers, define ACI network policy for these through Kubernetes, and get unified networking constructs for containers, virtual machines, and bare-metal. This brings the same level of deep integration to containers ACI has had with numerous hypervisors.

Improved Operational Flexibility and Visibility: The new Next Gen ACI User Interface improves usability with new consistent layouts and simplified topology views, and troubleshooting wizards. In addition, ACI now includes graceful insertion and removal, support for mixed operating systems and quota management, and latency measurements across fabric end points for troubleshooting.

Security: ACI 3.0 delivers new capabilities to protect networks by mitigating attacks such as IP/MAC spoofing with First Hop Security integration, automatically authenticating workloads in-band and placing them in trusted security groups, and support for granular policy enforcement for end points within the same security group.

“With ‘ACI Anywhere,’ Cisco is delivering a scalable solution that will help position customers for success in multi-cloud and multi-site environments,” said Dan Conde, an analyst with Enterprise Strategy Group. “ACI’s new integration with container cluster managers and its enhancements to zero trust security make this a modern offering for the market, whether you are a large Service Provider, Enterprise, or a commercial customer.”

Source: CloudStrategyMag

UKCloud Launches Cloud GPU Services

UKCloud Launches Cloud GPU Services

UKCloud has announced the launch of its Cloud GPU computing service based on NVIDIA virtual GPU solutions with NVIDIA Tesla P100 and M60 GPUs (graphics processing units). The service will support computational and visualisation intensive workloads for UKCloud’s UK public sector and health care customers. UKCloud is not only the first Cloud Service Provider based in the UK or Europe to offer Cloud GPU computing services with NVIDIA GPUs, but is also the only provider specialising in public sector and health care and the specific needs of these customers.

“Building on the foundation of UKCloud’s secure, assured, UK-Sovereign platform, we are now able to offer a range of cloud-based compute, storage and GPU services to meet our customers’ complex workload requirements,” said Simon Hansford, CEO, UKCloud. “The public sector is driving more complex computational and visualisation intensive workloads than ever before, not only for CAD development packages, but also for tasks like the simulation of infrastructure changes in transport, for genetic sequencing in health or for battlefield simulation in defence. In response to this demand, we have a greater focus on emerging technologies such as deep learning, machine learning and artificial intelligence.”

Many of today’s modern applications, especially in fields such as medical imaging or graphical analytics, need an NVIDIA GPU to power them, whether they are running on a laptop or desktop, on a departmental server or on the cloud. Just as organisations are finding that their critical business applications can be run more securely and efficiently in the cloud, so too they are realising that it makes sense to host graphical and visualisation intensive workloads there as well.

Adding cloud GPU computing services utilising NVIDIA technology to support more complex computational and visualisation intensive workloads was a customer requirement captured via UKCloud Ideas, a service that was introduced as part of UKCloud’s maniacal focus on customer service excellence. UKCloud Ideas proactively polls its clients for ideas and wishes for service improvements, enabling customers to vote on ideas and drive product improvements across the service. This has facilitated more than 40 feature improvements in the last year across UKCloud’s service catalogue from changes to the customer portal to product specific improvements.

One comment came from a UKCloud partner with many clients needing GPU capability: “One of our applications includes 3D functionality which requires a graphics card. We have several customers who might be interested in a hosted solution but would require access to this functionality. To this end it would be helpful if UKCloud were able to offer us a solution which included a GPU.”

Listening to its clients in this way and acting on their suggestions to improve its service by implementing NVIDIA GPU technology was one of a number of initiatives that enabled UKCloud to win a 2017 UK Customer Experience Award for putting customers at the heart of everything, through the use of technology.

“The availability of NVIDIA GPUs in the cloud means businesses can capitalise on virtualisation without compromising the functionality and responsiveness of their critical applications,” added Bart Schneider, Senior Director of CSP EMEA at NVIDIA. “Even customers running graphically complex or compute-intensive applications can benefit from rapid turn-up, service elasticity and cloud-economics.”

UKCloud’s GPU-accelerated cloud service, branded as Cloud GPU, is available in two versions: Compute and Visualisation. Both are based on NVIDIA GPUs and initially available only on UKCloud’s Enterprise Compute Cloud platform. They will be made available on UKCloud’s other platforms at a later date. The two versions are as follows:

  • UKCloud’s Cloud GPU Compute: This is a GPU accelerated computing service, based on the NVIDIA Tesla P100 GPU and supports applications developed using NVIDIA CUDA, that enables parallel co-processing on both the CPU and GPU. Typical use cases include looking for cures, trends and research findings in medicine along with genomic sequencing, data mining and analytics in social engineering, and trend identification and predictive analytics in business or financial modelling and other applications of AI and deep learning. Available from today with all VM sizes, Cloud GPU Compute will represent an additional cost of £1.90 per GPU per hour on top of the cost of the VM.
  • UKCloud’s Cloud GPU Visualisation: This is a virtual GPU (vGPU) service, utilising the NVIDIA Tesla M60, that extends the power of NVIDIA GPU technology to virtual desktops and apps. In addition to powering remote workspaces, typical use cases include military training simulations and satellite image analysis in defence, medical imaging and complex image rendering. Available from the end of October with all VM sizes, Cloud GPU Visualisation will represent an additional cost of £0.38 per vGPU per hour on top of the cost of the VM.

UKCloud has also received a top accolade from NVIDIA, that of ‘2017 Best Newcomer’ in the EMEA partner awards that were announced at NVIDIA’s October GPU Technology Conference 2017 in Munich. UKCloud was commended for making GPU technology more accessible for the UK public sector. As the first European Cloud Service Provider with NVIDIA GPU Accelerated Computing, UKCloud is helping to accelerate the adoption of Artificial Intelligence across all areas of the public sector, from central and local government to defence and healthcare, by allowing its customers and partners to harness the awesome power of GPU compute, without having to build specific rigs.

Source: CloudStrategyMag

Alibaba Cloud Joins Red Hat Certified Cloud And Service Provider Program

Alibaba Cloud Joins Red Hat Certified Cloud And Service Provider Program

Red Hat, Inc. and Alibaba Cloud have announced that they will join forces to bring the power and flexibility of Red Hat’s open source solutions to Alibaba Cloud’s customers around the globe.

Alibaba Cloud is now part of the Red Hat Certified Cloud and Service Provider program, joining a group of technology industry leaders who offer Red Hat-tested and validated solutions that extend the functionality of Red Hat’s broad portfolio of open source cloud solutions. The partnership extends the reach of Red Hat’s offerings across the top public clouds globally, providing a scalable destination for cloud computing and reiterating Red Hat’s commitment to providing greater choice in the cloud.

“Our customers not only want greater performance, flexibility, security and portability for their cloud initiatives; they also want the freedom of choice for their heterogeneous infrastructures. They want to be able to deploy their technologies of choice on their scalable infrastructure of choice. That is Red Hat’s vision and the focus of the Red Hat Certified Cloud and Service Provider Program. By working with Alibaba Cloud, we’re helping to bring more choice and flexibility to customers as they deploy Red Hat’s open source solutions across their cloud environments,” said Mike Ferris, vice president, technical business development and business architecture, Red Hat.

In the coming months, Red Hat solutions will be available directly to Alibaba Cloud customers, enabling them to take advantage of the full value of Red Hat’s broad portfolio of open source cloud solutions. Alibaba Cloud intends to offer Red Hat Enterprise Linux in a pay-as-you-go model in the Alibaba Cloud Marketplace.

By joining the Red Hat Certified Cloud and Service Provider program, Alibaba Cloud has signified that it is a destination for Red Hat customers, independent software vendors (ISVs) and partners to enable them to benefit from Red Hat offerings in public clouds. These will be provided under innovative consumption and service models with the greater confidence that Red Hat product experts have validated the solutions.

“As enterprises in China, and throughout the world, look to modernize application environments, a full-lifecycle solution by Red Hat on Alibaba Cloud can provide customers higher flexibility and agility. We look forward to working with Red Hat to help enterprise customers with their journey of scaling workloads to Alibaba Cloud.,” said Yeming Wang, deputy general manager of Alibaba Cloud Global, Alibaba Cloud.

Launched in 2009, the Red Hat Certified Cloud and Service Provider Program is designed to assemble the solutions cloud providers need to plan, build, manage, and offer hosted cloud solutions and Red Hat technologies to customers. The Certified Cloud Provider designation is awarded to Red Hat partners following validation by Red Hat. Each provider meets testing and certification requirements to demonstrate that they can deliver a safe, scalable, supported, and consistent environment for enterprise cloud deployments.

In addition, in the coming months, Red Hat customers will be able to move eligible, unused Red Hat subscriptions from their datacenter to Alibaba Cloud, China’s largest public cloud service provider, using Red Hat Cloud Access. Red Hat Cloud Access is an innovative “bring-your-own-subscription” benefit available from select Red Hat Certified Cloud and Service Providers that enables customers to move eligible Red Hat subscriptions from on-premise to public clouds. Red Hat Cloud Access also enables customers to maintain a direct relationship with Red Hat – including the ability to receive full support from Red Hat’s award-winning Global Support Services organization, enabling customers to maintain a consistent level of service across certified hybrid deployment infrastructures.

Source: CloudStrategyMag

Edgeconnex® Enables Cloudflare Video Streaming Service

Edgeconnex® Enables Cloudflare Video Streaming Service

EdgeConneX® has announced a new partnership with Cloudflare to enable and deploy its new Cloudflare Stream service. The massive Edge deployment will roll out in 18 Edge Data Centers® (EDCs) across North America and Europe, enabling Cloudflare to bring data within a few milliseconds of local market endusers and providing fast and effective delivery of bandwidth-intensive content.

Cloudflare powers more than 10% of all Internet requests and ensures that web properties, APIs and applications run efficiently and stay online. On September 27, 2017, exactly seven years after the company’s launch, Cloudflare expanded its offerings with Cloudflare Stream, a new service that combines encoding and global delivery to form a solution for the technical and business issues associated with video streaming. By deploying Stream at all of Cloudflare’s edge nodes, Cloudflare is providing customers the ability to integrate high-quality, reliable streaming video into their applications.

In addition to the launch of Stream, Cloudflare is rolling out four additional new services: Unmetered Mitigation, which eliminates surge pricing for DDoS mitigation; Geo Key Manager, which provides customers with granular control over where they place their private keys; Cloudflare Warp, which eliminates the effort required to fully mask and protect an application; and Cloudflare Workers, which writes and deploys JavaScript code at the edge. As part of its ongoing global expansion, Cloudflare is launching with EdgeConneX to serve more customers with fast and reliable web services.

“We think video streaming will be a ubiquitous component within all websites and apps in the future, and it’s our immediate goal to expand the number of companies that are streaming video from 1,000 to 100,000,” explains Matthew Prince, co-founder and CEO, Cloudflare. “Combined with EdgeConneX’s portfolio of Edge Data Centers, our technology enables a global solution across all 118 of our points of presence, for the fastest and most secure delivery of video and Internet content.”

In order to effectively deploy its services, including the newly launched Stream solution, Cloudflare is allowing customers to run basic software at global facilities located at the Edge of the network. To achieve this, Cloudflare has selected EdgeConneX to provide fast and reliable content delivery to end users. When deploying Stream and other services in EDCs across North America and Europe, Cloudflare will utilize this massive Edge deployment to further enhance its service offerings.

Cloudflare’s performance gains from EdgeConneX EDCs have been verified by Cedexis, the leader in latency-based load balancing for content and cloud providers. Their panel of Real User Measurement data showed significant response time improvements immediately following the EdgeConneX EDC deployments — 33% in the Minneapolis metro area and 20% in the Portland metro area.

“When it comes to demonstrating the effectiveness of storing data at an EdgeConneX EDC, the numbers speak for themselves,” says Clint Heiden, chief commercial officer, EdgeConneX. “We look forward to continuing our work with Cloudflare to help them deliver a wide range of cutting-edge services to their customer base, including Cloudflare Stream.”

Source: CloudStrategyMag

IDG Contributor Network: AI and quantum computing: technology that's fueling innovation and solving future problems

IDG Contributor Network: AI and quantum computing: technology that's fueling innovation and solving future problems

Two weeks ago, I spent time in Orlando, Florida, attending Microsoft’s huge IT pro and developer conference known as Microsoft Ignite. Having the opportunity to attend events such as this to see the latest in technological advancements is one of the highlights of my job. Every year, I am amazed at what new technologies are being made available to us. The pace of innovation has increased exponentially over the last five years. I can only imagine what the youth of today will bring to this world as our next generation’s creators.

Microsoft’s CEO, Satya Nadella, kicked off the vision keynote on Day 1. As always, he gets the crowd pumped up with his inspirational speeches. If you saw Satya’s keynote last year, you could almost bet on what he was going to be talking about this year. His passion, and Microsoft’s mission, is to empower every person and every organization on the planet to achieve more. This is a bold statement, but one that I believe is possible. He also shared how Microsoft is combining cloud, artificial intelligence, and mixed reality across their product portfolio to help customers innovate and build the future of business. This was followed by a demonstration of how Ford Motor was able to use these technologies to improve product design and engineering and innovate at a much faster pace today. It’s clear to me that AI is going to be a core part of our lives as we continue to evolve with this technology.

The emergence of machine learning business models based on the use of the cloud is in fact a big factor for why AI is taking off. Prior to the cloud, AI projects had high costs abut cloud economics have rendered certain machine learning capabilities relatively inexpensive and less complicated to operate. Thanks to the integration of cloud and AI, very specialized artificial intelligence startups are exploding in growth. Besides the aforementioned Microsoft, AI projects and initiatives at tech giants such as Facebook, Google, and Apple are also exploding.

As we move forward, the potential for these technologies to help people in ways that we have never been able to before is going to become more of a reality than a dream. Technologies such as AI, serverless computing, containers, augmented reality, and, yes, quantum computing will fundamentally change how we do things and fuel innovation at a pace faster than ever before.

One of the most exciting moments that had everyone’s attention at Ignite was when Microsoft shared what it has been doing around quantum computing. We’ve heard about this, but is it real? The answer is yes. Other influential companies such as IBM and Google are investing resources in this technology as well. It’s quite complex but very exciting. To see a technology like this come to fruition and make an impact in my lifetime would be nothing short of spectacular.

Moore’s Law states the number of transistors on a microprocessor will double every 18 months. Today, traditional computers store data as binary digits represented by either a 1 or 0 to signify a state of on or off. With this model, we have come a long way from the early days of computing power, but there is still a need for even faster and more powerful processing. Intel is already working with 10-nanometer manufacturing process technology, code-named Cannon Lake, that will offer reduced power consumption, higher density, and increased performance. In the very near future circuits will have to be measured on an atomic scale. This is where quantum computing comes in.

I’m not an expert in this field, but I have taken an interest in this technology as I have a background in electronics engineering. In simple terms—quantum computing harnesses the power of atoms and molecules to perform memory and processing tasks. Quantum computing is combining the best of math, physics, and computer science using what is referred to as electron fractionalization.

Quantum computers aren’t limited to only two states. They encode information using quantum bits, otherwise known as qubits. This involves being able to store data as both 1s and 0s, known as superposition, at the same time which unlocks parallelism. That probably doesn’t tell you much but think of it this way: This technology could enable us to solve complex problems in hours or days that would normally take billions of years with traditional computers. Think about that for a minute and you will realize just how significant this could be. This could enable researchers to develop and simulate new materials, improve medicines, accelerate AI and solve world hunger and global warming. Quantum computing will help us solve the impossible.

There are some inherent challenges with quantum computing. If you try to look at a qubit you risk bumping it, thereby causing its value to change. Scientists have devised ways to observe these quantum superpositions without destroying them. This is done by using cryogenics to cool the quantum chips down to temperatures in the range of 0.01ºK (–459.65ºF) where there are no vibrations to interfere with measurements.

Soon, developers will be able to test algorithms by running them in a local simulator on your computer, simulating around 30 qubits, or in Azure simulating around 40 quibits. As companies such as Microsoft, Google, and IBM continue to develop technologies such as this, dreams of quantum computing are becoming a reality. This technological innovation is not about who is the first to prove the value of quantum computing. This is about solving real world problems for our future generations in hopes of a better world.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: The rise and predominance of Apache Spark

IDG Contributor Network: The rise and predominance of Apache Spark

Initially open-sourced in 2012 and followed by its first stable release two years later, Apache Spark quickly became a prominent player in the big data space. Since then, its adoption by big data companies has been on the rise at an eye-catching rate.

In-memory processing

Undoubtedly a key feature of Spark, in-memory processing, is what makes the technology deliver the speed that dwarfs performance of conventional big data processing. But in-memory processing isn’t a new computing concept, and there is a long list of database and data-processing products with an underlying design of in-memory processing. Redis and VoltDB are a couple of examples. Another example is Apache Ignite, which is also equipped with in-memory processing capability supplemented by a WAL (write-ahead log) to address performance of big data queries and ACID (atomicity, consistency, isolation, durability) transactions.

Evidently, the functionality of in-memory processing alone isn’t quite sufficient to differentiate a product from others. So, what makes Spark stand out from the rest in the highly competitive big data processing arena?

BI/OLAP at scale with speed

For starters, I believe Spark successfully captures a sweet spot that few other products do. The need for the ever demanding high-speed BI (business intelligence) analytics has, in a sense, started to blur the boundary between the OLAP (online analytical processing) and OLTP (online transaction processing) worlds.

On one hand, we have distributed computing platforms such as Hadoop providing a MapReduce programming model, in addition to its popular distributed file system (HDFS). While MapReduce is a great data processing methodology, it’s a batch process that doesn’t deliver results in a timely manner.

On the other hand, there are big data processing products addressing the need of OLTP. Examples of products in this category include Phoenix on HBase, Apache Drill, and Ignite. Some of these products provide a query engine that emulates standard SQL’s transactional processing functionality to various extent to apply to key-value based or column-oriented databases.

What was missing but in high demand in the big data space is a product that does batch OLAP at scale with speed. There is indeed a handful of BI analytics/OLAP products such as Apache Kylin and Presto. Some of these products manage to fill the gap with some success in the very space. But it’s Spark that has demonstrated success in simultaneously addressing both speed and scale.

Nevertheless, Spark isn’t the only winner in the ‘speed + scale’ battle. Emerged around the same time as Apache Spark did, Impala (now an Apache incubator project) has also demonstrated remarkable performance in both speed and scale in its recent release. Yet, it has never achieved the same level of popularity as Spark does. So, something else in Spark must have made it more appealing to contemporary software engineers.

Immutable data with functional programming

Apache Spark provides API for three types of dataset: RDDs (resilient distributed data) are immutable distributed collection of data manipulatable using functional transformations (map, reduce, filter, etc.), DataFrames are immutable distributed collections of data in a table-like form with named columns and each row a generic untyped JVM objects called Row, and Datasets are collections of strongly-typed JVM objects.

Regardless of the API you elect to use, data in Spark is immutable and changes applied to the data are via compositional functional transformations. In a distributed computing environment, data immutability is highly desirable for concurrent access and performance at scale. In addition, such approach in formulating and resolving data processing problem in the functional programming style has been favored by many software engineers and data scientists these days.

On MapReduce, Spark provides an API using implementation of map(), flatMap()>, groupBy(), reduce() in classic functional programming language such as Scala. These methods can be applied to datasets in a compositional fashion as a sequence of data transformations, bypassing the need of coding modules of mappers and reducers as in conventional MapReduce.

Spark is “lazy”

An underlying design principle that plays a pivotal role in the operational performance of Spark is “laziness.” Spark is lazy in the sense that it holds off actual execution of transformations until it receives requests for resultant data to be returned to the driver program (i.e., the submitted application that is being serviced in an active execution context).

Such execution strategy can significantly minimize disk and network I/O, enabling it to perform well at scale. For example, in a MapReduce process, rather than returning the high-volume of data generated through map that is to be consumed by reduce, Spark may elect to return only the much smaller resultant data from reduce to the driver program.

Cluster and programming language support

As a distributed computing framework, robust cluster management functionality is essential for scaling out horizontally. Spark has been known for its effective use of available CPU cores on over thousands of server nodes. Besides the default standalone cluster mode, Spark also supports other clustering managers including Hadoop YARN and Apache Mesos.

On programming languages, Spark supports Scala, Java, Python, and R. Both Scala and R are functional programming languages at their heart and have been increasingly adopted by the technology industry in general. Programming in Scala on Spark feels like home given that Spark itself is written in Scala, whereas R is primarily tailored for data science analytics.

Python, with its popular data sicence libraries like NumPy, is perhaps one of the fastest growing programming language partly due to the increasing demand in data science work. Evidently, Spark’s Python API (PySpark) has been quickly adopted in volume by the big data community. Interoperable with NumPy, Spark’s machine learning library MLlib built on top of its core engine has helped fuel enthusiasm from the data science community.

On the other hand, Java hasn’t achieved the kind of success Python enjoys on Spark. Apparently the Java API on Spark feels like an afterthought. I’ve seen on a few occasions something rather straight forward using Scala needs to be worked around with lengthy code in Java on Spark.

Power of SQL and user-defined functions

SQL-compliant query capability is a significant part of the Spark’s strength. Recent releases of Spark API support SQL 2003 standard. One of the most sought-after query features is the window functions, which are not even available in some major SQL-based RDBMS like MySQL. Window functions enable one to rank or aggregate rows of data over a sliding window of rows that help minimize expensive operations such as joining of DataFrames.

Another important feature of Spark API’s are user-defined functions (UDF), which allow one to create custom functions that leverage the vast amount of general-purpose functions available on the programming language to apply to the data columns. While there is a handful of functions specific for the DataFrame API, with UDF one can expand to using of virtually any methods available, say, in the Scala programming language to assemble custom functions.

Spark streaming

In the scenario that data streaming is an requirement on top of building an OLAP system, the necessary integration effort could be challenging. Such integration generally requires not only involving a third-party streaming library, but also making sure that the two disparate APIs will cooperatively and reliably work out the vast difference in latency between near-real-time and batch processing.

Spark provides a streaming library that offers fault-tolerant distributed streaming functionality. It performs streaming by treating small contiguous chunks of data as a sequence of RDDs which are Spark’s core data structure. The inherent streaming capability undoubtedly alleviates the burden of having to integrate high-latency batch processing tasks with low-latency streaming routines.

Visualization, and beyond

Last but not least, Spark’s web-based visual tools reveal detailed information related to how a data processing job is performed. Not only do the tools show you the break-down of the tasks on individual worker nodes of the cluster, they also give details down to the life cycle of the individual execution processes (i.e., executors) allocated for the job. In addition, Spark’s visualization of complex job flow in the form of DAG (directed acyclic graph) offers in-depth insight into how a job is executed. It’s especially useful in troubleshooting or performance-tuning an application.

So, it isn’t just one or two things among the long list of in-memory processing speed, scalability, addressing of the BI/OLAP niche, functional programming style, data immutability, lazy execution strategy, appeal to the rising data science community, robust SQL capability and task visualization, etc. that propel Apache Spark to be a predominant frontrunner in the big data space. It’s the collective strength of the complementary features that truly makes Spark stand out from the rest.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

General Electric Names AWS Its Preferred Cloud Provider

General Electric Names AWS Its Preferred Cloud Provider

Amazon Web Services, Inc. has announced that General Electric has selected AWS as its preferred cloud provider. GE continues to migrate thousands of core applications to AWS. GE began an enterprise-wide migration in 2014, and today many GE businesses, including GE Power, GE Aviation, GE Healthcare, GE Transportation, and GE Digital, run many of their cloud applications on AWS. Over the past few years, GE migrated more than 2,000 applications, several of which leverage AWS’s analytics and machine learning services.

“Adopting a cloud-first strategy with AWS is helping our IT teams get out of the business of building and running data centers and refocus our resources on innovation as we undergo one of the largest and most important transformations in GE’s history,” said Chris Drumgoole, chief technology officer and corporate vice president, General Electric. “We chose AWS as the preferred cloud provider for GE because AWS’s industry leading cloud services have allowed us to push the boundaries, think big, and deliver better outcomes for GE.”

“Enterprises across industries are migrating to AWS in droves, and in the process are discovering the wealth of new opportunities that open up when they have the most comprehensive menu of cloud capabilities — which is growing daily — at their fingertips,” said Mike Clayville, vice president, worldwide commercial sales, AWS. “GE has been at the forefront of cloud adoption, and we’ve been impressed with the pace, scope, and innovative approach they’ve taken in their journey to AWS. We are honored that GE has chosen AWS as their preferred cloud provider, and we’re looking forward to helping them as they continue their digital industrial transformation.”

Source: CloudStrategyMag