Spark tutorial: Get started with Apache Spark

November 27, 2017 by Ian Pointer Posted in Industry Insights & News

Apache Spark has become the de facto standard for processing data at scale, whether for querying large datasets, training machine learning models to predict future trends, or processing streaming data. In this article, we’ll show you how to use Apache Spark to analyze data in both Python and Spark SQL. And we’ll extend our code to support Structured Streaming, the new current state of the art for handling streaming data within the platform. We’ll be using Apache Spark 2.2.0 here, but the code in this tutorial should also work on Spark 2.1.0 and above.

How to run Apache Spark

Before we begin, we’ll need an Apache Spark installation. You can run Spark in a number of ways. If you’re already running a Hortonworks, Cloudera, or MapR cluster, then you might have Spark installed already, or you can install it easily through Ambari, Cloudera Navigator, or the MapR custom packages.

If you don’t have such a cluster at your fingertips, then Amazon EMR or Google Cloud Dataproc are both easy ways to get started. These cloud services allow you to spin up a Hadoop cluster with Apache Spark installed and ready to go. You’ll be billed for compute resources with an extra fee for the managed service. Remember to shut the clusters down when you’re not using them!

Of course, you could instead download the latest release from spark.apache.org and run it on your own laptop. You will need a Java 8 runtime installed (Java 7 will work, but is deprecated). Although you won’t have the compute power of a cluster, you will be able to run the code snippets in this tutorial.

Source: InfoWorld Big Data

What is Apache Spark? The big data analytics platform explained

November 13, 2017 by Ian Pointer Posted in Industry Insights & News

What is Apache Spark? The big data analytics platform explained

From its humble beginnings in the AMPLab at U.C. Berkeley in 2009, Apache Spark has become one of the key big data distributed processing frameworks in the world. Spark can be deployed in a variety of ways, provides native bindings for the Java, Scala, Python, and R programming languages, and supports SQL, streaming data, machine learning, and graph processing. You’ll find it used by banks, telecommunications companies, games companies, governments, and all of the major tech giants such as Apple, Facebook, IBM, and Microsoft.

Out of the box, Spark can run in a standalone cluster mode that simply requires the Apache Spark framework and a JVM on each machine in your cluster. However, it’s more likely you’ll want to take advantage of a resource or cluster management system to take care of allocating workers on demand for you. In the enterprise, this will normally mean running on Hadoop YARN (this is how the Cloudera and Hortonworks distributions run Spark jobs), but Apache Spark can also run on Apache Mesos, while work is progressing on adding native support for Kubernetes.

If you’re after a managed solution, then Apache Spark can be found as part of Amazon EMR, Google Cloud Dataproc, and Microsoft Azure HDInsight. Databricks, the company that employs the founders of Apache Spark, also offers the Databricks Unified Analytics Platform, which is a comprehensive managed service that offers Apache Spark clusters, streaming support, integrated web-based notebook development, and optimized cloud I/O performance over a standard Apache Spark distribution.

Spark vs. Hadoop

It’s worth pointing out that Apache Spark vs. Apache Hadoop is a bit of a misnomer. You’ll find Spark included in most Hadoop distributions these days. But due to two big advantages, Spark has become the framework of choice when processing big data, overtaking the old MapReduce paradigm that brought Hadoop to prominence.

The first advantage is speed. Spark’s in-memory data engine means that it can perform tasks up to one hundred times faster than MapReduce in certain situations, particularly when compared with multi-stage jobs that require the writing of state back out to disk between stages. Even Apache Spark jobs where the data cannot be completely contained within memory tend to be around 10 times faster than their MapReduce counterpart.

The second advantage is the developer-friendly Spark API. As important as Spark’s speed-up is, one could argue that the friendliness of the Spark API is even more important.

Spark Core

In comparison to MapReduce and other Apache Hadoop components, the Apache Spark API is very friendly to developers, hiding much of the complexity of a distributed processing engine behind simple method calls. The canonical example of this is how almost 50 lines of MapReduce code to count words in a document can be reduced to just a few lines of Apache Spark (here shown in Scala):

val textFile = sparkSession.sparkContext.textFile(“hdfs:///tmp/words”)
val counts = textFile.flatMap(line => line.split(“ “))
.map(word => (word, 1))
.reduceByKey(_ + _)
counts.saveAsTextFile(“hdfs:///tmp/words_agg”)

By providing bindings to popular languages for data analysis like Python and R, as well as the more enterprise-friendly Java and Scala, Apache Spark allows everybody from application developers to data scientists to harness its scalability and speed in an accessible manner.

Spark RDD

At the heart of Apache Spark is the concept of the Resilient Distributed Dataset (RDD), a programming abstraction that represents an immutable collection of objects that can be split across a computing cluster. Operations on the RDDs can also be split across the cluster and executed in a parallel batch process, leading to fast and scalable parallel processing.

RDDs can be created from simple text files, SQL databases, NoSQL stores (such as Cassandra and MongoDB), Amazon S3 buckets, and much more besides. Much of the Spark Core API is built on this RDD concept, enabling traditional map and reduce functionality, but also providing built-in support for joining data sets, filtering, sampling, and aggregation.

Spark runs in a distributed fashion by combining a driver core process that splits a Spark application into tasks and distributes them among many executor processes that do the work. These executors can be scaled up and down as required for the application’s needs.

Spark SQL

Originally known as Shark, Spark SQL has become more and more important to the Apache Spark project. It is likely the interface most commonly used by today’s developers when creating applications. Spark SQL is focused on the processing of structured data, using a dataframe approach borrowed from R and Python (in Pandas). But as the name suggests, Spark SQL also provides a SQL2003-compliant interface for querying data, bringing the power of Apache Spark to analysts as well as developers.

Alongside standard SQL support, Spark SQL provides a standard interface for reading from and writing to other datastores including JSON, HDFS, Apache Hive, JDBC, Apache ORC, and Apache Parquet, all of which are supported out of the box. Other popular stores—Apache Cassandra, MongoDB, Apache HBase, and many others—can be used by pulling in separate connectors from the Spark Packages ecosystem.

Selecting some columns from a dataframe is as simple as this line:

citiesDF.select(“name”, “pop”)

Using the SQL interface, we register the dataframe as a temporary table, after which we can issue SQL queries against it:

citiesDF.createOrReplaceTempView(“cities”)
spark.sql(“SELECT name, pop FROM cities”)

Behind the scenes, Apache Spark uses a query optimizer called Catalyst that examines data and queries in order to produce an efficient query plan for data locality and computation that will perform the required calculations across the cluster. In the Apache Spark 2.x era, the Spark SQL interface of dataframes and datasets (essentially a typed dataframe that can be checked at compile time for correctness and take advantage of further memory and compute optimizations at run time) is the recommended approach for development. The RDD interface is still available, but is recommended only if you have needs that cannot be encapsulated within the Spark SQL paradigm.

Spark MLlib

Apache Spark also bundles libraries for applying machine learning and graph analysis techniques to data at scale. Spark MLlib includes a framework for creating machine learning pipelines, allowing for easy implementation of feature extraction, selections, and transformations on any structured dataset. MLLib comes with distributed implementations of clustering and classification algorithms such as k-means clustering and random forests that can be swapped in and out of custom pipelines with ease. Models can be trained by data scientists in Apache Spark using R or Python, saved using MLLib, and then imported into a Java-based or Scala-based pipeline for production use.

Note that while Spark MLlib covers basic machine learning including classification, regression, clustering, and filtering, it does not include facilities for modeling and training deep neural networks (for details see InfoWorld’s Spark MLlib review). However, Deep Learning Pipelines are in the works.

Spark GraphX

Spark GraphX comes with a selection of distributed algorithms for processing graph structures including an implementation of Google’s PageRank. These algorithms use Spark Core’s RDD approach to modeling data; the GraphFrames package allows you to do graph operations on dataframes, including taking advantage of the Catalyst optimizer for graph queries.

Spark Streaming

Spark Streaming was an early addition to Apache Spark that helped it gain traction in environments that required real-time or near real-time processing. Previously, batch and stream processing in the world of Apache Hadoop were separate things. You would write MapReduce code for your batch processing needs and use something like Apache Storm for your real-time streaming requirements. This obviously leads to disparate codebases that need to be kept in sync for the application domain despite being based on completely different frameworks, requiring different resources, and involving different operational concerns for running them.

Spark Streaming extended the Apache Spark concept of batch processing into streaming by breaking the stream down into a continuous series of microbatches, which could then be manipulated using the Apache Spark API. In this way, code in batch and streaming operations can share (mostly) the same code, running on the same framework, thus reducing both developer and operator overhead. Everybody wins.

A criticism of the Spark Streaming approach is that microbatching, in scenarios where a low-latency response to incoming data is required, may not be able to match the performance of other streaming-capable frameworks like Apache Storm, Apache Flink, and Apache Apex, all of which use a pure streaming method rather than microbatches.

Structured Streaming

Structured Streaming (added in Spark 2.x) is to Spark Streaming what Spark SQL was to the Spark Core APIs: A higher-level API and easier abstraction for writing applications. In the case of Structure Streaming, the higher-level API essentially allows developers to create infinite streaming dataframes and datasets. It also solves some very real pain points that users have struggled with in the earlier framework, especially concerning dealing with event-time aggregations and late delivery of messages. All queries on structured streams go through the Catalyst query optimizer, and can even be run in an interactive manner, allowing users to perform SQL queries against live streaming data.

Structured Streaming is still a rather new part of Apache Spark, having been marked as production-ready in the Spark 2.2 release. However, Structured Streaming is the future of streaming applications with the platform, so if you’re building a new streaming application, you should use Structured Streaming. The legacy Spark Streaming APIs will continue to be supported, but the project recommends porting over to Structured Streaming, as the new method makes writing and maintaining streaming code a lot more bearable.

What’s next for Apache Spark?

While Structured Streaming provides high-level improvements to Spark Streaming, it currently relies on the same microbatching scheme of handling streaming data. However, the Apache Spark team is working to bring continuous streaming without microbatching to the platform, which should solve many of the problems with handling low-latency responses (they’re claiming ~1ms, which would be very impressive). Even better, because Structured Streaming is built on top of the Spark SQL engine, taking advantage of this new streaming technique will require no code changes.

In addition improving streaming performance, Apache Spark will be adding support for deep learning via Deep Learning Pipelines. Using the existing pipeline structure of MLlib, you will be able to construct classifiers in just a few lines of code, as well as apply custom Tensorflow graphs or Keras models to incoming data. These graphs and models can even be registered as custom Spark SQL UDFs (user-defined functions) so that the deep learning models can be applied to data as part of SQL statements.

Neither of these features is anywhere near production-ready at the moment, but given the rapid pace of development we’ve seen in Apache Spark in the past, they should be ready for prime time in 2018.

Source: InfoWorld Big Data

Had it with Apache Storm? Heron swoops to the rescue

June 2, 2016 by Ian Pointer Posted in Industry Insights & News

Had it with Apache Storm? Heron swoops to the rescue

Last year, Twitter dropped two bombshells. First, it would no longer use Apache Storm in production. Second, it had replaced it with a homegrown data processing system, Heron.

Despite releasing a paper detailing the architecture of Heron, Twitter’s alternative to Storm remained hidden in Twitter’s data centers. That all changed last week when Twitter released Heron under an open source license. So what is Heron, and where does it fit in the world of data processing at scale?

A directed acyclic graph (DAG) data processing engine, Heron is another entry in a very crowded field right now. But Heron is not a “look, me too!” solution or an attempt to turn DAG engines into big data’s equivalent of FizzBuzz.

Heron grew out of real concerns Twitter was having with its large deployment of Storm topologies. These included difficulties with profiling and reasoning about Storm workers when scaled at the data level and at a topology level, the static nature of resource allocation in comparison to a system that runs on Mesos or YARN, lack of back-pressure support, and more.

Although Twitter could have adopted Apache Spark or Apache Flink, that would have involved rewriting all of Twitter’s existing code. (Don’t forget, Twitter has used Storm longer than anybody else, acquiring BackType, Storm’s creator, back in 2011 before it was open source.) Instead, Twitter took a different approach: a new stream processing framework with a Storm-compatible API.

At this point in our walk through a new framework, I’d normally go through some examples to show you what coding in the framework feels like, but there’s little point with Heron — you write Storm bolts and tuples in exactly the same manner as you would with Storm. All you need to do to run your Storm code on Heron is to add this section to your pom.xml’s dependencies:

com.twitter.heron

heron-api

SNAPSHOT

compile

com.twitter.heron

heron-storm

SNAPSHOT

compile

Then you remove your storm-code and clojure-plugin dependencies. Recompile, and your code will run on Heron with no further changes necessary. Simple! (Mostly, anyhow, but we’ll come back to that.)

Operationally, Heron’s current implementation runs on top of Apache Mesos, using Apache Aurora, the Mesos scheduling framework developed by Twitter (surprise!). Since switching all its Storm topologies over to Heron, Twitter managed to reduce hardware resources dedicated to the topologies by a factor of three while increasing throughput and reducing latency in processing — not bad.

Perhaps one of the most interesting aspects about Heron is that while code for it will be written in Java (or Scala), and the web-based UI components are written in Python, the critical parts of the framework, the code that manages the topologies and network communications are not written in a JVM language at all.

Indeed, at the heart of Heron, you’ll find code in a language you might not expect: C++. I think this is an aspect of the big data world that we’ll see more of in the years to come.

The Apache Storm maintainers have removed many elements of its original Clojure code in favor of Java reimplementations, and the Apache Spark project currently generates Java code on-the-fly to speed up its DataFrame processing. But both are still tied to the JVM — and the JVM has problems at scale. Don’t get me wrong, the JVM is an amazing creation that has stood the test of time for 20 years, but when running on machines with huge amounts of RAM and processing tremendous amounts of data, problems with garbage collection emerge, no matter what fancy collector scheme you use.

At which point, moving back to a language like C++ starts to look appealing. As an example, Scylla, a C++ reimplementation of Apache Cassandra, has 10 times the throughput of Cassandra with none of the GC pauses that Cassandra is notorious for at large deployments. I’m fairly confident we’ll see Heron’s approach spread to other frameworks soon. This may be helped by Project Panama’s attempt to improve the interface between Java and other languages.

Given that Heron requires fewer resources and provides more throughput and less latency than Apache Storm, you should move all your topologies over to Heron right now, yes? Well, maybe. Heron is currently tied to Mesos, so if you don’t have existing Mesos infrastructure, you’ll need to set that up as well, which is no small undertaking. Also, if you’re making use of Storm’s DRPC features, they’re deprecated in Heron.

On the plus side, Heron has been running all of Twitter’s processing needs in production for more than a year, so it should be able to handle anything you can throw at it. Plus, Twitter points out that Heron is used at Microsoft and other Fortune 500 companies, so you can be relatively confident it’s going to stick around.

On the other hand, Storm hasn’t been standing still. The Apache Storm team might quibble with Twitter’s description of Heron as the “next generation of Apache Storm.” While Twitter was working on Heron, Apache Storm reached 1.0 — which includes support for back pressure, improved debugging and profiling options, a 60 percent decrease in latency, and up to a 16-fold speed improvement.

In addition, Storm 1.0 adds pacemaker, a daemon for offloading heartbeat traffic from ZooKeeper, freeing larger topologies from the infamous ZooKeeper bottleneck. Heron’s speed improvements are measured from the Storm 0.8.x code it diverged from, not the current version; if you have migrated over to Storm 1.0 already, you might not see much more improvement over your current Storm topologies, and you may run into incompatibilities between the implementation of new features like back-pressure support between Storm and Heron.

All in all, I don’t believe that Heron is likely to cause much of a dent in the uptake of data processing frameworks such as Apache Spark, Apache Flink, or Apache Beam. Their higher-level abstractions and APIs provide a much more developer-friendly experience than the lower-level Storm/Trident APIs. However, I believe the blend of JVM code with non-JVM modules for the critical paths is going to be a more popular approach going forward, and in this aspect, Heron shows us all the direction we’ll be traveling in the months and years to come.

Source: InfoWorld Big Data

Spark 2.0 prepares to catch fire

May 26, 2016 by Ian Pointer Posted in Industry Insights & News

Spark 2.0 prepares to catch fire

Apache Spark 2.0 is almost upon us. If you have an account on Databricks’ cloud offering, you can get access to a technical preview today; for the rest of us, it may be a week or two, but by Spark Summit next month, I expect Apache Spark 2.0 to be out in the wild. What should you look forward to?

During the 1.x series, the development of Apache Spark was often at a breakneck pace, with all sorts of features (ML pipelines, Tungsten, the Catalyst query planner) added along the way during minor version bumps. Given this, and that Apache Spark follows semantic versioning rules, you can expect 2.0 to make breaking changes and add major new features.

Unify DataFrames and Datasets

One of the main reasons for the new version number won’t be noticed by many users: In Spark 1.6, DataFrames and Datasets are separate classes; in Spark 2.0, a DataFrame is simply an alias for a Dataset of type Row.

This may mean little to most of us, but such a big change in the class hierarchy means we’re looking at Spark 2.0 instead of Spark 1.7. You can now get compile-time type safety for DataFrames in Java and Scala applications and use both the typed methods (map, filter) and the untyped methods (select, groupBy) in both DataFrames and Datasets.

The all-new and improved SparkSession

A common question when working with Spark: “So, we have a SparkContext, a SQLContext, and a HiveContext. When should I use one and not the others?” Spark 2.0 introduces a new SparkSession object that reduces confusion and provides a consistent entry point for computation with Spark. Here’s what creating a SparkSession looks like:

val sparkSession = SparkSession.builder

.master("local")

.appName("my-spark-app")

.config("spark.some.config.option", "config-value")

.getOrCreate()

If you use the REPL, a SparkSession is automatically set up for you as Spark. Want to read data into a DataFrame? Well, it should look somewhat familiar:

spark.read. json ("JSON URL")

In another sign that operations using Spark’s initial abstraction of Resilient Distributed Dataset (RDD) are being de-emphasized, you’ll need to get at the underlying SparkContext using spark.sparkContext to create RDDs. Once again, RDDs aren’t going away, but the preferred DataFrame paradigm is becoming more and more prevalent, so if you haven’t worked with them yet, you will soon.

For those of you who have jumped into SparkSQL with both feet and discovered that sometimes you had to fight the query engine, Spark 2.0 has some extra goodies for you as well. There’s a new SQL parsing engine which includes support for subqueries and many SQL 2003 features (though it doesn’t claim full support yet), which should make porting legacy SQL applications to Spark a much more pleasant affair.

Structured Streaming

Structured Streaming is likely to be the new feature that everybody is excited about in the weeks and months to come. With good reason! I went into a lot of detail about what Structured Streaming is a few weeks ago, but as a quick recap, Apache Spark 2.0 brings a new paradigm for processing streaming data, moving away from the batched processing of RDDs to a concept of a DataFrame without bounds.

This will make certain types of streaming scenarios like change-data-capture and update-in-place much easier to implement — and allow windowing on time columns in the DataFrame itself instead of when new events enter the streaming pipeline. This has been a long-running thorn in Spark Streaming’s side, especially in comparison to competitors like Apache Flink and Apache Beam, so this addition alone will make many happy to upgrade to 2.0.

Performance improvements

Much effort has been spent on making Spark run faster and smarter in 2.0. The Tungsten engine has been augmented with bytecode optimizers that borrow techniques from compilers to reduce function calls and keep the CPU occupied efficiently during processing.

Parquet support has been improved, resulting in a 10-fold speed-up in some cases, and the use of Encoders over Java or Kryo serialization, first seen in Spark 1.6, continues to reduce memory usage and increase throughput in your cluster.

ML/GraphX

If you’re expecting big changes in the machine learning and graphing side of Spark, you might be a touch disappointed. The important change to Spark’s machine learning offerings is that development in the spark.mllib library is frozen. You should instead use the DataFrame-based API in spark.ml, which is where development will be concentrated going forward.

Spark 2.0 brings full support for model and ML pipeline persistence across all of its supported languages and makes more of the MLLib API available to Python and R for all of your data scientists who recoil in terror from Java or Scala.

As for GraphX, it seems to be a bit unloved in Spark 2.0. Instead, I’d urge you to keep an eye on GraphFrames. Currently a separate release from the main distribution, this builds a graph processing framework on top of DataFrames that is accessible from Java, Scala, Python, and R. I wouldn’t be surprised if this UC Berkeley/MIT/Databricks collaboration finds its way into Spark 3.0.

Say hello, wave good-bye

Of course, a new major version number is a great time to make breaking changes. Here are a couple of changes that may cause issues:

Dropping support for versions of Hadoop prior to 2.2
Removing the Bagel graphing library (the pre-cursor to GraphX)

An important deprecation that you will almost certainly run across is the renaming of registerTempTable in SparkSQL. You should use createTempView instead, which makes it clearer that you’re not actually materializing any data with the API call. Expect a gaggle of deprecation notices in your logs from this change.

Should I rush to upgrade?

With promised large gains in performance and long-awaited new features in Spark Streaming, it’s tempting to hit Upgrade as soon as Apache Spark 2.0 becomes generally available in the next few weeks.

I would temper that impulse with a note or two of caution. A lot has changed under the covers for this release, so expect some bugs to crawl out as people start running their existing code on test clusters.

Nonetheless, with a brace of new features and performance improvements, it’s clear that Apache Spark 2.0 deserves its full version bump. Look for it in the next few weeks!

Source: InfoWorld Big Data

Bare Metal Servers and Cloud Server Hosting

Author Archives: Ian Pointer
Home / Articles Posted by Ian Pointer

Spark tutorial: Get started with Apache Spark

How to run Apache Spark

What is Apache Spark? The big data analytics platform explained

Spark vs. Hadoop

Spark Core

Spark RDD

Spark SQL

Spark MLlib

Spark GraphX

Spark Streaming

Structured Streaming

What’s next for Apache Spark?

Had it with Apache Storm? Heron swoops to the rescue

Spark 2.0 prepares to catch fire

Unify DataFrames and Datasets