Apache Eagle keeps an eye on big data usage

Apache Eagle keeps an eye on big data usage

Apache Eagle, originally developed at eBay and then donated to the Apache Software Foundation, fills big data security niche that remains thinly populated, if not bare: It sniffs out possible security and performance issues with big data frameworks.

To do this, Eagle uses other Apache open source components, such as Kafka, Spark, and Storm, to generate and analyze machine learning models from the behavioral data of big data clusters.

Looking in from the inside

Data for Eagle can come from activity logs for various data source (HDFS, Hive, MapR FS, Cassandra, etc.) or from performance metrics harvested directly from frameworks like Spark. The data can then be piped by the Kafka streaming framework into a real-time detection system that’s built with Apache Storm, or into a model-training system built on Apache Spark. The former’s for generating alerts and reports based on existing policies; the latter is for creating machine learning models to drive new policies.

This emphasis on real-time behavior tops the list of “key qualities” in the documentation for Eagle. It’s followed by “scalability,” “metadata driven” (meaning changes to policies are deployed automatically when their metadata is changed), and “extensibility.” This last means the data sources, alerting systems, and policy engines used by Eagle are supplied by plugins and aren’t limited to what’s in the box.

Because Eagle’s been put together from existing parts of the Hadoop world, it has two theoretical advantages. One, there’s less reinvention of the wheel. Two, those who already have experience with the pieces in question will have a leg up.

What are my people up to?

Aside from the above-mentioned use cases like analyzing job performance and monitoring for anomalous behavior, Eagle can also analyze user behaviors. This isn’t about, say, analyzing data from a web application to learn about the public users of that app, but rather the users of the big data framework itself — the folks building and managing the Hadoop or Spark back end. An example of how to run such analysis is included, and it could be deployed as-is or modified.

Eagle also allows application data access to be classified according to levels of sensitivity. Only HDFS, Hive, and HBase applications can make use of this feature right now, but its interaction with them provides a model for how other data sources could also be classified.

Let’s keep this under control

Because big data frameworks are fast-moving creations, it’s been tough to build reliable security around them. Eagle’s premise is that it can provide policy-based analysis and alerting as a possible complement to other projects like Apache Ranger. Ranger provides authentication and access control across Hadoop and its related technologies; Eagle gives you some idea of what people are doing once they’re allowed inside.

The biggest question hovering over Eagle’s future — yes, even this early on — is to what degree Hadoop vendors will elegantly roll it into their existing distributions, or use their own security offerings. Data security and governance have long been one of the missing pieces that commercial offerings could compete on.

Source: InfoWorld Big Data

Apache Beam unifies batch and streaming for big data

Apache Beam unifies batch and streaming for big data

Apache Beam, a unified programming model for both batch and streaming data, has graduated from the Apache Incubator to become a top-level Apache project.

Aside from becoming another full-fledged widget in the ever-expanding Apache tool belt of big-data processing software, Beam addresses ease of use and dev-friendly abstraction, rather than just offering upraw speed or a wider array of included processing algorithms.

Beam us up!

Beam provides a single programming model for creating batch and stream processing jobs (the name is a hybrid of “batch” and “stream”), and it offers a layer of abstraction for dispatching to various engines used to run said jobs. The project originated at Google, where it’s currently a service called GCD (Google Cloud Dataflow). Beam uses the same API as GCD, and it can use GCD as an execution engine, along with Apache Spark, Apache Flink (a stream processing engine with a highly memory-efficient design), and now Apache Apex (another stream engine for working closely with Hadoop deployments).

The Beam model involves five components: the pipeline (the pathway for data through the program); the “PCollections,” or data streams themselves; the transforms, for processing data; the sources and sinks, where data’s fetched and eventually sent; and the “runners,” or components that allow the whole thing to be executed on a given engine.

Apache says it separated concerns in this fashion so that Beam can “easily and intuitively express data processing pipelines for everything from simple batch-based data ingestion to complex event-time-based stream processing.” This is in line with how tools like Apache Spark have been reworked to support stream and batch processing within the same product and with similar programming models. In theory, it’s one less concept for a prospective developer to wrap her head around, but that presumes Beam is used entirely in lieu of Spark or other frameworks, when it’s more likely that it’ll be used — at least at first — to augment them.

Hands off

One possible drawback to Beam’s approach is that while the layers of abstraction in the product make operations easier, they also put the developer at a distance from the underlying layers. A good case in point is Beam’s current level of integration with Apache Spark; the Spark runner doesn’t yet use Spark’s more recent DataFrames system, and thus may not take advantage of the optimizations those can provide. But this isn’t a conceptual flaw, it’s an issue with the implementation, which can be addressed in time.

The big payoff of using Beam, as noted by Ian Pointer in his discussion of Beam in early 2016, is that it makes migrations between processing systems less of a headache. Likewise, Apache says that Beam “cleanly [separates] the user’s processing logic from details of the underlying engine.”

Separation of concern and ease of migration will be good to have if the ongoing rivalries and competitions between the various big data processing engines continues. Granted, Apache Spark has emerged as one of the undisputed champs of the field, and become a de facto standard choice. But there’s always room for improvement, or an entirely new streaming or processing paradigm. Beam is less about offering a specific alternative than about providing developers and data-wranglers with more breadth of choice between them.

Source: InfoWorld Big Data

Google BigQuery provides insight into Stack Overflow discussion data

Google BigQuery provides insight into Stack Overflow discussion data

Software development discussion site Stack Overflow has started offering quarterly snapshots of its question-and-answer database through Google’s BigQuery.

Stack Exchange, parent company for Stack Overflow and its sister sites, has previously made its data available to researchers throught its online data explorer. But now researchers with a Google Cloud Platform account can plug directly into the data set using Google’s data exploration tools, which have fewer limitations than Stack Overflow’s.

If you have a Google Cloud account, you can log in and begin exploring the data directly from a SQL-style web interface. Results from queries can be exported to CSV or JSON, saved to other tables in Google BigQuery, or exported to Google Sheets. BigQuery also comes with a REST API, so it can be used with third-party visualization tools or software stacks.

Stack Overflow’s question-and-answer format is popular with developers seeking quick solutions to common problems. Though it has a reputation for being insular and unwelcoming, it’s  widely trafficked, and many of its highest-voted answers are widely circulated as great explainers. For example, a popular question about why processing a sorted array is faster than working with an unsorted one not only gives a detailed technical answer, but also serves as great explainer for the concept of branch prediction failure.

One possible application for Stack Overflow’s data, with or without BigQuery’s tool set, is sentiment analysis of topics and discussions taking place on Stack Overflow–in other words, getting broad hints about developers’ feelings about a technology.

If discussions about a language are paired with discussions about an IDE for that language, those threads could be parsed for details about what people are (or aren’t) doing most often with that pairing. Thus, you could figure out what developers might need but aren’t yet asking for.

Stack Overflow’s yearly surveys of its developers provide a similar snapshot of its audience’s mindsets: what languages are popular or how developers classify themselves. But such surveys are self-conscious and self-reporting, and they’re limited to the categories devised for them. Discussions on the site could provide more open-ended, direct, and detailed data about what developers like, hate, look for, and struggle with.

Note that this data set comes from Stack Overflow, and not from any of the other IT-related Stack Exchange sites, such as Server Fault (for IT admins) or Super User (for “computer enthusiasts and power users”). If these data sets go online through Google BigQuery as well, they could open up possibilities for even larger and more sophisticated analyses across multiple IT disciplines.

Source: InfoWorld Big Data

MariaDB crashes open souce big data analytics competitors

MariaDB crashes open souce big data analytics competitors

MySQL variant MariaDB is aiming for the OLAP market with the public release of its latest feature, ColumnStore 1.0.

The move is part of MariaDB’s mission to broaden its reach and be a cheaper alternative to analytics databases like Teradata or Vertica. But the company faces stiff open source competition.

Doing more with less

Originally announced in April, ColumnStore isn’t a new project; it’s a port of an existing one, InfiniDB, that used the MySQL engine. After the company that produced InfiniDB went defunct in 2015, MariaDB took over the project, continued supporting its existing customer base, and realized that InfiniDB’s column-oriented technology could add OLAP capabilities to the traditionally OLTP-oriented MySQL. (Column-stored data allows for high-speed reading and searching of datasets.)

MariaDB believes there are multiple advantages to blending the two approaches. One is being able to perform queries that mix both columnar InfiniDB data and row-based MariaDB data — for instance, being able to create SQL JOINs across both kinds of data. Another is having a native SQL querying layer for an OLAP solution, which many OLAP products have been adding separately with widely varying efficacy.

But the biggest advantage is cost. MariaDB claims that ColumnStore “on average costs 90.3% less per terabyte compared to commercial data warehouses,” but offers little specific detail — what size of database, which specific commercial competitors, etc. — to back up the claim. A sample customer story involving the World Bank’s Institute for Health Metrics and Evaluation mostly cites earlier versions of MySQL (due to existing infrastructure) and the in-memory MemSQL database as the other choices considered, rather than any of the more commercial data-warehousing solutions.

Not the only game in town

Late 2015 saw a major open source competitor to conventional data warehouses or OLAP analytics solutions emerge: Greenplum Database, the data warehouse solution open sourced by Pivotal.

In a way, Greenplum vs. ColumnStore amounts to a clash between two long-standing open source database projects. With ColumnStore, it’s MySQL/MariaDB; with Greenplum, it’s PostgreSQL, since Greenplum is derived from that project.

That said, the two have evolved far past their roots; the competition between them is less about what underlying technology they use and more about how large an existing audience each of them is likely to capture.

Greenplum is likely to appeal to those who are already settled on Pivotal in some form or another. ColumnStore is for those still on MariaDB, but about to outgrow it because they’re tackling problems of far larger scope than MariaDB was set to handle. By offering ColumnStore, MariaDB aims to stave off migrations not just to competing products, but to new-breed warehousing services like Snowflake that are both increasingly cost-effective and ANSI SQL-compliant.

Source: InfoWorld Big Data

Move over Memcached and Redis, here comes Netflix's Hollow

Move over Memcached and Redis, here comes Netflix's Hollow

After two years of internal use, Netflix is offering a new open source project as a powerful option to cache data sets that change constantly.

Hollow is a Java library and toolset aimed at in-memory caching of data sets up to several gigabytes in size. Netflix says Hollow’s purpose is threefold: It’s intended to be more efficient at storing data; it can provide tools to automatically generate APIs for convenient access to the data; and it can automatically analyze data use patterns to more efficiently synchronize with the back end.

Let’s keep this between us

Most of the scenarios for caching data on a system where it isn’t stored—a “consumer” system rather than a “producer” system—involve using a product like Memcached or Redis. Hollow is reminiscent of both products since it uses in-memory storage for fast access, but it isn’t an actual data store like Redis.

Unlike many other data caching systems, Hollow is intended to be coupled to a specific data set—a given schema with certain fields, typically a JSON stream. This requires some prep work, although Hollow provides some tools to partly automate the process. The reason for doing so: Hollow can store the data in-memory as fixed-length, strongly typed chunks that aren’t subject to Java’s garbage collection. As a result, they’re faster to access than conventional Java objects.

Another purported boon with Hollow is that it provides a gamut of tooling for working with the data. Once you’ve defined a schema for the data, Hollow can automatically produce a Java API that can supply autocomplete data to an IDE. The data can also be tracked as it changes, so developers have access to point-in-time snapshots, differences between snapshots, and data rollbacks.

Faster all around

A lot of the advantages Netflix claims for Hollow involve basic operational efficiency—namely, faster startup time for servers and less memory churn. But Hollow’s data modeling and management tools are also meant to help with development, not simply speed production.

“Imagine being able to quickly shunt your entire production data set—current or from any point in the recent past—down to a local development workstation, load it, then exactly reproduce specific production scenarios,” Netflix says in its introductory blog post.

One caveat is that Hollow isn’t suited for data sets of all sizes—“KB, MB, and GB, but not TB,” is how the company puts it in its documentation. That said, Netflix also implies that Hollow reduces the amount of sprawl required by a cached data set. “With the right framework, and a little bit of data modeling, that [memory] threshold is likely much higher than you think,” Netflix writes.

Source: InfoWorld Big Data

Apache Mesos users focus on big data, containers

Apache Mesos users focus on big data, containers

Mesosphere, the main commercial outfit behind the Apache Mesos datacenter and container orchestration project, has taken a good look at its user base and found that they gravitate toward a few fundamental use cases.

Survey data released recently by Mesosphere in the “Apache Mesos 2016 Survey Report,” indicates that Mesos users focus on running containers at scale, using Mesos to deploy big data frameworks, and relying heavily on the core tool set that Mesos and DC/OS provide rather than using substitutes.

We got this contained

Created in 2009, Mesos was built to run workloads of all types and sizes across clusters of systems. DC/OS, released back in 2015 by Mesosphere, automates the deployment, provisioning, and scaling of applications with Mesos as the underlying technology. Thus, it casts Mesos as a commodity similar to Docker, which offers ease in working with long-standing containerization techniques.

The Mesosphere survey doesn’t cover a very large sample of users — fewer than 500, with 63 percent of those surveyed running Mesos for less than a year. Deployments are also modest — the overwhelming majority are fewer than 100 nodes — and by and large favor generic software/IT industry settings. Retail, e-commerce, telecom, and finance made up about 19 percent of the total combined.

Among the workloads deployed in Mesos, the largest slice (85 percent) covers containers and microservices, with 62 percent of all users deploying containers in production. Containers have long been a major part of Mesos’ and DC/OS’s focus, but Mesos sets itself apart from other container projects by providing a robust solution to container management, including native support for GPU-powered applications.

Do it yourself

The second biggest slice of the pie is data-centric applications. No prizes for guessing the top entry in that category: Apache Spark (43 percent of users), followed by other major big data infrastructure components like the Kafka messaging system (32 percent), the Elastic search system (26 percent), and the Cassandra NoSQL database (24 percent). Hadoop is in the mix as well, but only at 11 percent.

If there’s a takeaway to be found, it’s that specific solutions like Spark demonstrate more immediate payoffs than general solutions like Hadoop, especially when projects like DC/OS make them easier to deploy.

The survey also makes clear that Mesos users have historically put together projects themselves, but they like the idea of having the option to not have to. Of those who use Mesos, few currently do so with DC/OS’s automated deployment. Only 26 percent of those surveyed are running it in a production context, with another 12 percent “piloting for broader deployment.” That implies that many existing Mesos-powered deployments are hand-built.

However, newly minted Mesos users are going straight to DC/OS to get their Mesos fix. Eighty-seven percent of users who started with Mesos in the past six months did so through DC/OS. Thus, it’s safe to assume as DC/OS becomes more widely used and Mesos continues to evolve (it recently hit a 1.0 release), DC/OS will become the predominant preference to deploy both Mesos and all the apps that run with it.

It’s important to think about Mesos and DC/OS as complementary technologies to the rest of the container world, not total replacements for it. Kubernetes, for instance, can run in Mesos (and 8 percent of the respondents do use Kubernetes somewhere, according to the survey). Rather than eclipsing such arrangements outright, it’s more likely that DC/OS and Mesos will provide a more convenient option to build with them.

Source: InfoWorld Big Data

Redis module speeds Spark-powered machine learning

Redis module speeds Spark-powered machine learning

In-memory data store Redis recently acquired a module architecture to expand functionality. The latest module is a machine learning add-on that accelerates delivery of results from trained data rather than training itself.

Redis-ML, or the Redis Module for Machine Learning, comes courtesy of the commercial outfit that drives Redis development, Redis Labs. It speeds the execution of machine learning models while still allowing those models to be trained in familiar ways. Redis works as an in-memory cache backed by disk storage, and its creators claim machine learning models can be executed orders of magnitude more quickly with it.

The module works in conjunction with Apache Spark, another in-memory data-processing tool with machine learning components. Spark handles the data-gathering phase, and Redis plugs into the Spark cluster through the pre-existing Redis Spark-ML module. The module generated by Spark’s training is then saved to Redis, rather than to an Apache Parquet or HDFS data store. To execute the models, you run the queries on the Redis-ML module, not Spark itself.

In the big picture, Redis-ML offers speed: faster responses to individual queries, smaller penalties for large numbers of users making requests, and the ability to provide high availability of the results via a scale-out Redis setup. Redis Labs claims the prediction process shows “5x to 10x latency improvement over the standard Spark solution in real-time classifications.”

Another boon is specifically for developers, as Redis-ML interoperates with Scala, JavaScript (via Node.js), Python, and the .Net languages. Models “are no longer restricted to the language they were developed in,” states Redis Labs, but “can be accessed by applications written in different languages concurrently using the simple [Redis-ML] API.” Redis Labs also claims the resulting trained model is easier to deploy, since it can be accessed through said APIs without custom code or infrastructure.

Accelerating Spark with other technologies isn’t a new idea. Previously, the idea was to speed up the storage back ends that Spark talks to. In fact, Redis’ engineers herald it as one such solution. Another project, Apache Arrow, speeds Spark execution (and other big data projects) by transforming data into a columnar format that can be processed more efficiently.

Redis Labs is pushing Redis as a broad solution to these problems, since its architecture (what its creators call a “structure store”) permits more free-form storage than competing database solutions. Redis VP of Product Management Cihan Biyikoglu noted in a phone interview that other databases attempt to adapt data types to the problems at hand; Redis, by contrast, instead of “shackling [you] to one data model, type, or representation,” allows “an abstraction that can house any type of data.”

If Redis Labs has a long-term plan, it’s to inch Redis toward becoming an all-in-one solution for machine learning — to provide a data-gathering and data-querying mechanism along with the machine learning libraries under one roof. To wit: Another Redis module, for Google’s TensorFlow framework, not only allows Redis to serve as backing for TensorFlow, but allows training TensorFlow models directly inside Redis.

Source: InfoWorld Big Data

Spark picks up machine learning, GPU acceleration

Spark picks up machine learning, GPU acceleration

Databricks, corporate provider of support and development for the Apache Spark in-memory big data project, has spiced up its cloud-based implementation of Apache Spark with two additions that top IT’s current hot list.

The new features — GPU acceleration and integration with numerous deep learning libraries — can in theory be implemented in any local Apache Spark installation. But Databricks says its versions are tuned to avoid the resource contentions that complicate the use of such features.

Apache Spark isn’t configured out of the box to provide GPU acceleration, and to set up a system to support it, users must cobble together several pieces. To that end, Databricks offers to handle all the heavy lifting.

Databricks also claims that Spark’s behaviors are tuned to get the most out of a GPU cluster by reducing the number of contentions across nodes. This seems similar to the strategy used by MIT’s Milk library to accelerate parallel processing applications, wherein operations involving memory are batched to take maximum advantage of a system’s cache line. Likewise, Databricks’ setup tries to keep GPU operations from interrupting each other.

Another time-saving measure is adding direct access to popular machine learning libraries that can use Spark as a data source. Among them is Databricks’ TensorFrames, which allows the TensorFlow library to work with Spark and is GPU-enabled.

Databricks has tweaked its infrastructure to get the most out of Spark. It created a free tier of service to attract customers still wary of deep commitment, providing them with a subset of the conveniences available in the full-blown product. InfoWorld’s Martin Heller checked out the service earlier this year and liked what he saw, precisely because it was free to jump into and easy to get started.

But competition will be fierce, especially since Databricks faces brand-name juggernauts like Microsoft (via Azure Machine Learning), IBM, and Amazon. Thus, it has to find ways to both keep and expand an audience for a service as specific and focused as its own. The plan appears to involve not only adding features like machine learning and GPU acceleration to the mix, but ensuring they bring convenience, not complexity.

Source: InfoWorld Big Data

The wait for TensorFlow on Windows is almost over

The wait for TensorFlow on Windows is almost over

When will it be possible to run Google’s TensorFlow deep learning system on Windows with full GPU support? The short answer is “soon.”

The real holdup, though, hasn’t even been TensorFlow. It’s been the lack of a working Windows version of Bazel, Google’s in-house tool that delivers TensorFlow builds.

TensorFlow on Windows seems like a no-brainer. Support for GPU-accelerated applications on Windows is highly robust, and Windows is about as popular a platform as you could ask for. To that end, a GitHub issue has been open with TensorFlow for providing Windows support since November of last year.

But the lack of a Windows version of Bazel has kept TensorFlow off Windows — until now. A working edition of Bazel has finally shipped for Windows, and it’s even available to developers through the Chocolatey package management system.

The other delay is adding GPU support to TensorFlow on Windows. While TensorFlow can fall back to CPUs across multiple nodes as a compatibility measure, it’s best run with full GPU support. After some work, said support for Windows is now on the verge of being merged into the project’s mainline.

An earlier fork of the project, produced some two months ago, provided a Windows build for TensorFlow via CMake and Visual Studio 2015 rather than Bazel. But it lacked support for GPU acceleration, and the cost of not using Bazel for the build process might well have been unsupportable over time.

Getting TensorFlow on Windows, then, is a double milestone. Aside from putting a powerful and useful deep learning tool into the hands of a much broader audience of users, the process of bringing it to that audience means future Google projects built with Bazel will also have native Windows versions sooner, too.

Source: InfoWorld Big Data

Snowflake now offers data warehousing to the masses

Snowflake now offers data warehousing to the masses

Snowflake, the cloud-based data warehouse solution co-founded by Microsoft alumnus Bob Muglia, is lowering storage prices and adding a self-service option, meaning prospective customers can open an account with nothing more than a credit card.

These changes also raise an intriguing question: How long can a service like Snowflake expect to reside on Amazon, which itself offers services that are more or less in direct competition — and where the raw cost of storage undercuts Snowflake’s own pricing for same?

Open to the public

The self-service option, called Snowflake On Demand, is a change from Snowflake’s original sales model. Rather than calling a sales representative to set up an account, Snowflake users can now provision services themselves with no more effort than would be needed to spin up an AWS EC2 instance.

In a phone interview, Muglia discussed how the reason for only just now transitioning to this model was more technical than anything else. Before self-service could be offered, Snowflake had to put protections into place to ensure that both the service itself and its customers could be protected from everything from malice (denial-of-service attacks) to incompetence (honest customers submitting massively malformed queries).

“We wanted to make sure we had appropriately protected the system,” Muglia said, “before we opened it up to anyone, anywhere.”

This effort was further complicated by Snowflake’s relative lack of hard usage limits, which Muglia characterized as being one of its major standout features. “There is no limit to the number of tables you can create,” Muglia said, but he further pointed out that Snowflake has to strike a balance between what it can offer any one customer and protecting the integrity of the service as a whole.

“We get some crazy SQL queries coming in our direction,” Muglia said, “and regardless of what comes in, we need to continue to perform appropriately for that customer as well as other customers. We see SQL queries that are a megabyte in size — the query statements [themselves] are a megabyte in size.” (Many such queries are poorly formed, auto-generated SQL, Muglia claimed.)

Fewer costs, more competition

The other major change is a reduction in storage pricing for the service — $30/TB/month for capacity storage, $50/TB/month for on-demand storage, and compressed storage at $10/TB/month.

It’s enough of a reduction in price that Snowflake will be unable to rely on storage costs as a revenue source, since those prices barely pay for the use of Amazon’s services as a storage provider. But Muglia is confident Snowflake is profitable enough overall that such a move won’t impact the company’s bottom line.

“We did the data modeling on this,” said Muglia, “and our margins were always lower on storage than on compute running queries.”

According to the studies Snowflake performed, “when customers put more data into Snowflake, they run more queries…. In almost every scenario you can imagine, they were very much revenue-positive and gross-margin neutral, because people run more queries.”

The long-term implications for Snowflake continuing to reside on Amazon aren’t clear yet, especially since Amazon might well be able to undercut Snowflake by directly offering competitive services.

Muglia, though, is confident that Snowflake’s offering is singular enough to stave off competition for a good long time, and is ready to change things up if need be. “We always look into the possibility of moving to other cloud infrastructures,” Muglia said, “although we don’t have plans to do it right now.”

He also noted that Snowflake competes with Amazon and Redshift right now, but “we have a very different shape of product relative to Redshift…. Snowflake is storing multiple petabytes of data and is able to run hundreds of simultaneous concurrent queries. Redshift can’t do that; no other product can do that. It’s that differentiation that allows to effective compete with Amazon, and for that matter Google and Microsoft and Oracle and Teradata.” 

Source: InfoWorld Big Data