IDG Contributor Network: The rise and predominance of Apache Spark

IDG Contributor Network: The rise and predominance of Apache Spark

Initially open-sourced in 2012 and followed by its first stable release two years later, Apache Spark quickly became a prominent player in the big data space. Since then, its adoption by big data companies has been on the rise at an eye-catching rate.

In-memory processing

Undoubtedly a key feature of Spark, in-memory processing, is what makes the technology deliver the speed that dwarfs performance of conventional big data processing. But in-memory processing isn’t a new computing concept, and there is a long list of database and data-processing products with an underlying design of in-memory processing. Redis and VoltDB are a couple of examples. Another example is Apache Ignite, which is also equipped with in-memory processing capability supplemented by a WAL (write-ahead log) to address performance of big data queries and ACID (atomicity, consistency, isolation, durability) transactions.

Evidently, the functionality of in-memory processing alone isn’t quite sufficient to differentiate a product from others. So, what makes Spark stand out from the rest in the highly competitive big data processing arena?

BI/OLAP at scale with speed

For starters, I believe Spark successfully captures a sweet spot that few other products do. The need for the ever demanding high-speed BI (business intelligence) analytics has, in a sense, started to blur the boundary between the OLAP (online analytical processing) and OLTP (online transaction processing) worlds.

On one hand, we have distributed computing platforms such as Hadoop providing a MapReduce programming model, in addition to its popular distributed file system (HDFS). While MapReduce is a great data processing methodology, it’s a batch process that doesn’t deliver results in a timely manner.

On the other hand, there are big data processing products addressing the need of OLTP. Examples of products in this category include Phoenix on HBase, Apache Drill, and Ignite. Some of these products provide a query engine that emulates standard SQL’s transactional processing functionality to various extent to apply to key-value based or column-oriented databases.

What was missing but in high demand in the big data space is a product that does batch OLAP at scale with speed. There is indeed a handful of BI analytics/OLAP products such as Apache Kylin and Presto. Some of these products manage to fill the gap with some success in the very space. But it’s Spark that has demonstrated success in simultaneously addressing both speed and scale.

Nevertheless, Spark isn’t the only winner in the ‘speed + scale’ battle. Emerged around the same time as Apache Spark did, Impala (now an Apache incubator project) has also demonstrated remarkable performance in both speed and scale in its recent release. Yet, it has never achieved the same level of popularity as Spark does. So, something else in Spark must have made it more appealing to contemporary software engineers.

Immutable data with functional programming

Apache Spark provides API for three types of dataset: RDDs (resilient distributed data) are immutable distributed collection of data manipulatable using functional transformations (map, reduce, filter, etc.), DataFrames are immutable distributed collections of data in a table-like form with named columns and each row a generic untyped JVM objects called Row, and Datasets are collections of strongly-typed JVM objects.

Regardless of the API you elect to use, data in Spark is immutable and changes applied to the data are via compositional functional transformations. In a distributed computing environment, data immutability is highly desirable for concurrent access and performance at scale. In addition, such approach in formulating and resolving data processing problem in the functional programming style has been favored by many software engineers and data scientists these days.

On MapReduce, Spark provides an API using implementation of map(), flatMap()>, groupBy(), reduce() in classic functional programming language such as Scala. These methods can be applied to datasets in a compositional fashion as a sequence of data transformations, bypassing the need of coding modules of mappers and reducers as in conventional MapReduce.

Spark is “lazy”

An underlying design principle that plays a pivotal role in the operational performance of Spark is “laziness.” Spark is lazy in the sense that it holds off actual execution of transformations until it receives requests for resultant data to be returned to the driver program (i.e., the submitted application that is being serviced in an active execution context).

Such execution strategy can significantly minimize disk and network I/O, enabling it to perform well at scale. For example, in a MapReduce process, rather than returning the high-volume of data generated through map that is to be consumed by reduce, Spark may elect to return only the much smaller resultant data from reduce to the driver program.

Cluster and programming language support

As a distributed computing framework, robust cluster management functionality is essential for scaling out horizontally. Spark has been known for its effective use of available CPU cores on over thousands of server nodes. Besides the default standalone cluster mode, Spark also supports other clustering managers including Hadoop YARN and Apache Mesos.

On programming languages, Spark supports Scala, Java, Python, and R. Both Scala and R are functional programming languages at their heart and have been increasingly adopted by the technology industry in general. Programming in Scala on Spark feels like home given that Spark itself is written in Scala, whereas R is primarily tailored for data science analytics.

Python, with its popular data sicence libraries like NumPy, is perhaps one of the fastest growing programming language partly due to the increasing demand in data science work. Evidently, Spark’s Python API (PySpark) has been quickly adopted in volume by the big data community. Interoperable with NumPy, Spark’s machine learning library MLlib built on top of its core engine has helped fuel enthusiasm from the data science community.

On the other hand, Java hasn’t achieved the kind of success Python enjoys on Spark. Apparently the Java API on Spark feels like an afterthought. I’ve seen on a few occasions something rather straight forward using Scala needs to be worked around with lengthy code in Java on Spark.

Power of SQL and user-defined functions

SQL-compliant query capability is a significant part of the Spark’s strength. Recent releases of Spark API support SQL 2003 standard. One of the most sought-after query features is the window functions, which are not even available in some major SQL-based RDBMS like MySQL. Window functions enable one to rank or aggregate rows of data over a sliding window of rows that help minimize expensive operations such as joining of DataFrames.

Another important feature of Spark API’s are user-defined functions (UDF), which allow one to create custom functions that leverage the vast amount of general-purpose functions available on the programming language to apply to the data columns. While there is a handful of functions specific for the DataFrame API, with UDF one can expand to using of virtually any methods available, say, in the Scala programming language to assemble custom functions.

Spark streaming

In the scenario that data streaming is an requirement on top of building an OLAP system, the necessary integration effort could be challenging. Such integration generally requires not only involving a third-party streaming library, but also making sure that the two disparate APIs will cooperatively and reliably work out the vast difference in latency between near-real-time and batch processing.

Spark provides a streaming library that offers fault-tolerant distributed streaming functionality. It performs streaming by treating small contiguous chunks of data as a sequence of RDDs which are Spark’s core data structure. The inherent streaming capability undoubtedly alleviates the burden of having to integrate high-latency batch processing tasks with low-latency streaming routines.

Visualization, and beyond

Last but not least, Spark’s web-based visual tools reveal detailed information related to how a data processing job is performed. Not only do the tools show you the break-down of the tasks on individual worker nodes of the cluster, they also give details down to the life cycle of the individual execution processes (i.e., executors) allocated for the job. In addition, Spark’s visualization of complex job flow in the form of DAG (directed acyclic graph) offers in-depth insight into how a job is executed. It’s especially useful in troubleshooting or performance-tuning an application.

So, it isn’t just one or two things among the long list of in-memory processing speed, scalability, addressing of the BI/OLAP niche, functional programming style, data immutability, lazy execution strategy, appeal to the rising data science community, robust SQL capability and task visualization, etc. that propel Apache Spark to be a predominant frontrunner in the big data space. It’s the collective strength of the complementary features that truly makes Spark stand out from the rest.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

General Electric Names AWS Its Preferred Cloud Provider

General Electric Names AWS Its Preferred Cloud Provider

Amazon Web Services, Inc. has announced that General Electric has selected AWS as its preferred cloud provider. GE continues to migrate thousands of core applications to AWS. GE began an enterprise-wide migration in 2014, and today many GE businesses, including GE Power, GE Aviation, GE Healthcare, GE Transportation, and GE Digital, run many of their cloud applications on AWS. Over the past few years, GE migrated more than 2,000 applications, several of which leverage AWS’s analytics and machine learning services.

“Adopting a cloud-first strategy with AWS is helping our IT teams get out of the business of building and running data centers and refocus our resources on innovation as we undergo one of the largest and most important transformations in GE’s history,” said Chris Drumgoole, chief technology officer and corporate vice president, General Electric. “We chose AWS as the preferred cloud provider for GE because AWS’s industry leading cloud services have allowed us to push the boundaries, think big, and deliver better outcomes for GE.”

“Enterprises across industries are migrating to AWS in droves, and in the process are discovering the wealth of new opportunities that open up when they have the most comprehensive menu of cloud capabilities — which is growing daily — at their fingertips,” said Mike Clayville, vice president, worldwide commercial sales, AWS. “GE has been at the forefront of cloud adoption, and we’ve been impressed with the pace, scope, and innovative approach they’ve taken in their journey to AWS. We are honored that GE has chosen AWS as their preferred cloud provider, and we’re looking forward to helping them as they continue their digital industrial transformation.”

Source: CloudStrategyMag

11 open source tools to make the most of machine learning

11 open source tools to make the most of machine learning

Venerable Shogun was created in 1999 and written in C++, but can be used with Java, Python, C#, Ruby, R, Lua, Octave, and Matlab. The latest version, 6.0.0, adds native support for Microsoft Windows and the Scala language.

Though popular and wide-ranging, Shogun has competition. Another C++-based machine learning library, Mlpack, has been around only since 2011, but professes to be faster and easier to work with (by way of a more integral API set) than competing libraries.

Project: Shogun
GitHub: https://github.com/shogun-toolbox/shogun

Source: InfoWorld Big Data

MapR Converged Data Platform Now Available in Oracle Cloud Marketplace

MapR Converged Data Platform Now Available in Oracle Cloud Marketplace

MapR Technologies, Inc. has announced that the MapR Converged Data Platform is now available in the Oracle Cloud Marketplace bringing a modern data system to Oracle Cloud customers. A silver level member of the Oracle PartnerNetwork (OPN), MapR Data Technologies enables a unified operational and analytic data service for Oracle Cloud customers.

“In today’s rapidly evolving technology landscape, organizations moving to the cloud are looking for increased flexibility and the quickest time to value,” said Sanjay Sinha, vice president cloud platform products, Oracle. “With Oracle Cloud, MapR can quickly and efficiently address the growing needs of its customers with support for Oracle’s state-of-the-art cloud platform.”

The MapR Converged Data Platform enables global access to a wide variety of data sources including big data workloads such as Apache Hadoop and Apache Spark, POSIX compliant file systems, NFS-enabled file systems, multi-model databases, and streaming data. The MapR Platform enables customers to collect data in the Oracle Cloud as well as any number of other data sources – on-premise, hybrid cloud and even other public clouds – supporting analytics, deep learning, machine learning, artificial intelligence and edge computing.

“The MapR Platform is designed to better leverage public clouds with the recent introduction of MapR Orbit Cloud Suite. Joint customers will be able to run big data workloads on Oracle Cloud Infrastructure with cloud-native operations and cloud storage integration,” said Tom Fisher, CTO, MapR. “Our participation in the Oracle Cloud Marketplace further extends our commitment to the Oracle community and enables customers to easily reap the benefits of harnessing value from all of their data. We look forward to leveraging the power of the Oracle Cloud to help us achieve our business goals.”

The Oracle Cloud Marketplace is a one-stop shop for Oracle customers seeking trusted business applications and service providers offering unique business solutions, including ones that extend Oracle Cloud Applications. Oracle Cloud is the industry’s broadest and most complete public cloud, delivering enterprise-grade services at every level of the cloud technology stack including software as a service (SaaS), platform as a service (PaaS), infrastructure as a service (IaaS), and data as a service (DaaS).

Source: CloudStrategyMag