5 Python libraries to lighten your machine learning load

5 Python libraries to lighten your machine learning load

Machine learning is exciting, but the work is complex and difficult. It typically involves a lot of manual lifting — assembling workflows and pipelines, setting up data sources, and shunting back and forth between on-prem and cloud-deployed resources.

The more tools you have in your belt to ease that job, the better. Thankfully, Python is a giant tool belt of a language that’s widely used in big data and machine learning. Here are five Python libraries that help relieve the heavy lifting for those trades.

PyWren

A simple package with a powerful premise, PyWren lets you run Python-based scientific computing workloads as multiple instances of AWS Lambda functions. A profile of the project at The New Stack describes PyWren using AWS Lambda as a giant parallel processing system, tackling projects that can be sliced and diced into little tasks that don’t need a lot of memory or storage to run.

One downside is that lambda functions can’t run for more than 300 seconds max. But if you need a job that takes only a few minutes to complete and need to run it thousands of times across a data set, PyWren may be a good option to parallelize that work in the cloud at a scale unavailable on user hardware.

Tfdeploy

Google’s TensorFlow framework is taking off big-time now that it’s at a full 1.0 release. One common question about it: How can I make use of the models I train in TensorFlow without using TensorFlow itself?

Tfdeploy is a partial answer to that question. It exports a trained TensorFlow model to “a simple NumPy-based callable,” meaning the model can be used in Python with Tfdeploy and the the NumPy math-and-stats library as the only dependencies. Most of the operations you can perform in TensorFlow can also be performed in Tfdeploy, and you can extend the behaviors of the library by way of standard Python metaphors (such as overloading a class).

Now the bad news: Tfdeploy doesn’t support GPU acceleration, if only because NumPy doesn’t do that. Tfdeploy’s creator suggests using the gNumPy project as a possible replacement.

Luigi

Writing batch jobs is generally only one part of processing heaps of data; you also have to string all the jobs together into something resembling a workflow or a pipeline. Luigi, created by Spotify and named for the other plucky plumber made famous by Nintendo, was built to “address all the plumbing typically associated with long-running batch processes.”

With Luigi, a developer can take several different unrelated data processing tasks — “a Hive query, a Hadoop job in Java, a Spark job in Scala, dumping a table from a database” — and create a workflow that runs them, end to end. The entire description of a job and its dependencies are created as Python modules, not as XML config files or another data format, so it can be integrated into other Python-centric projects.

Kubelib

If you’re adopting Kubernetes as an orchestration system for machine learning jobs, the last thing you want is for the mere act of using Kubernetes to create more problems than it solves. Kubelib provides a set of Pythonic interfaces to Kubernetes, originally to aid with Jenkins scripting. But it can be used without Jenkins as well, and it can do everything exposed through the kubectl CLI or the Kubernetes API.

PyTorch

Let’s not forget about this recent and high-profile addition to the Python world, an implementation of the Torch machine learning framework. PyTorch doesn’t only port Torch to Python, but adds many other conveniences, such as GPU acceleration and a library that allows multiprocessing to be done with shared memory (for partitioning jobs across multiple cores). Best of all, it can provide GPU-powered replacements for some of the unaccelerated functions in NumPy.

Source: InfoWorld Big Data

IDG Contributor Network: Bringing embedded analytics into the 21st century

IDG Contributor Network: Bringing embedded analytics into the 21st century

Software development has changed pretty radically over the last decade. Waterfall is out, Agile is in. Slow release cycles are out, continuous deployment is in. Developers avoid scaling up and scale out instead. Proprietary integration protocols have (mostly) given way to open standards.

At the same time, exposing analytics to customers in your application has gone from a rare, premium offering to a requirement. Static reports and SOAP APIs that deliver XML files just don’t cut it anymore.

And yet, the way that most embedded analytics systems are designed is basically the same as it was 10 years ago: Inflexible, hard to scale, lacking modern version control, and reliant on specialized, expensive hardware.

Build or Buy?

It’s no wonder that today’s developers often choose to build embedded analytics system in-house. Developers love a good challenge, so when faced with the choice between an outdated, off-the-shelf solution and building for themselves, they’re going to get to work.

But expectations for analytics have increased, and so even building out the basic functionality that customers demand can sidetrack engineers (whose time isn’t cheap) for months. This is to say nothing of the engineer-hours required to maintain a homegrown system down the line. I simply don’t believe that building it yourself is the right solution unless analytics is your core product.

So what do you do?

Honestly, I’m not sure. Given the market opportunity, I think it’s inevitable that more and more vendors will move into the space and offer modern solutions. And so I thought I’d humbly lay out 10 questions embedded analytic buyers should ask about the solutions they’re evaluating.

  1. How does the solution scale as data volumes grow? Does it fall down or require summarization when dealing with big data?
  2. How does the tool scale to large customer bases? Is supporting 1,000 customers different than supporting 10?
  3. Do I need to maintain specialized ETLs and data ingestion flows for each customer? What if I want to change the ETL behavior? How hard is that?
  4. What’s the most granular level that customers can drill to?
  5. Do I have to pay to keep duplicated data in a proprietary analytics engine? If so, how much latency does that introduce? How do things stay in sync?
  6. Can I make changes to the content and data model myself or is the system a black box where every change requires support or paid professional services?
  7. Does it use modern, open frameworks like HTML5, Javascript, iFrame, HTTPS and RESTful APIs?
  8. Does the platform offer version control? If so, which parts of the platform (data, data model, content, etc.) are covered by version control?
  9. How customizable is the front-end? Can fonts, color palettes, language, timezones, logos, and caching behavior all be changed? Can customization be done on a customer-by-customer basis or is it one template for all customers?
  10. How much training is required for admins and developers? And how intuitive is the end-user interface?

No vendor that I know of has the “right” answer to all these questions (yet), but they should be taking these issues seriously and working toward these goals.

If they’re not, you can bet your engineers are going to start talking about how they could build something better in a week. HINT: They actually can’t, but good luck winning that fight 😉

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

6 reasons stores can't give you real-time offers (yet)

6 reasons stores can't give you real-time offers (yet)

Like most hardcore people, in the car I roll with my windows down and my radio cranked up to 11—tuned to 91.5, my local NPR station, where Terry Gross recently interviewed Joseph Turow, author of “The Aisles Have Eyes.” Turow reports that retailers are using data gathered from apps on your phone and other information to change prices on the fly.

Having worked in this field for a while, I can tell you that, yes, they’re gathering any data they can get. But the kind of direct manipulation Turow claims, where the price changes on the shelf before your eyes, isn’t yet happening on a wide scale. (Full disclosure: I’m employed by LucidWorks, which offers personalized/targeted search and machine-learning-assisted search as features in products we sell.)

Why not? I can think of a number of reasons.

1. Technology changes behavior slowly

Printers used to be a big deal. There were font and typesetting wars (TrueType, PostScript, and so on), and people printed out pages simply to read comfortably. After all, screen resolutions were low and interfaces were clunky; scanners were cumbersome and email was unreliable. Yet even after these obstacles were overcome, the old ways stuck around. There are still paper books (I mailed all of mine to people in prison), and the government still makes me print things and even get them notarized sometimes.

Obviously, change happens: I now tend to use Uber even if a cab is waiting, and I don’t bother to check the price difference, regardless of surge status. Also, today I buy all my jeans from Amazon—yet still use plastic cards for payment. The clickstream data collected on me is mainly used for email marketing and ad targeting, as opposed to real-time sales targeting.

2. Only some people can be influenced

For years I put zero thought into my hand soap purchase because my partner bought it. Then I split with my partner and became a soap buyer again. I did some research and found a soap that didn’t smell bad, didn’t have too many harsh chemicals, and played lip service to the environment. Now, to get me to even try something else you’d probably have to give it to me for free. I’m probably not somebody a soap company wants to bother with. I’m not easily influenced.

I’m more easily influenced in other areas—such as cycling and fitness stuff—but those tend to be more expensive, occasional purchases. To reach me the technique needs to be different than pure retailing.

3. High cost for marginal benefit

Much personalization technology, such as the analytics behind real-time discounts, is still expensive to deploy. Basic techniques such as using my interests or previously clicked links to improve the likelihood of my making a purchase are probably “effective enough” for most online retailers.

As for brick and mortar, I have too many apps on my phone already, so getting me to download yours will require a heavy incentive. I also tend to buy only one item because I forgot to buy it online—then I leave—so the cost to overcome my behavioral inertia and influence me will be high.

4. Pay to play

Business interests limit the effectiveness of analytics in influencing consumers, mainly in the form of slotting fees charged to suppliers who want preferential product placement in the aisles.

Meanwhile, Target makes money no matter what soap I buy there. Unless incentivized, it’s not going to care which brand I choose. Effective targeting may require external data (like my past credit card purchases at other retailers) and getting that data may be expensive. The marketplace for data beyond credit card purchases is still relatively immature and fragmented.

5. Personalization is difficult at scale

For effective personalization, you must collect or buy data on everything I do everywhere and store it. You need to run algorithms against that data to model my behavior. You need to identify different means of influencing me. Some of this is best done for a large group (as in the case of product placement), but doing it for individuals requires lots of experimentation and tuning—and it needs to be done fast.

Plus, it needs to be done right. If you bug me too much, I’m totally disabling or uninstalling your app (or other means of contacting me). You need to make our relationship bidirecitonal. See yourself as my concierge, someone who finds me what I need and anticipates those needs rather than someone trying to sell me something. That gets you better data and stops you from getting on my nerves. (For the last time, Amazon, I’ve already purchased an Instant Pot, and it will be years before I buy another pressure cooker. Stop following me around the internet with that trash!)

6. Machine learning needs to mature

Machine learning is merely math; much of it isn’t even new. But applying it to large amounts of behavioral data—where you have to decide which algorithm to use, which optimizations to apply to that algorithm, and which behavioral data you need in order to apply it—is pretty new. Most retailers are used to buying out-of-the-box solutions. Beyond (ahem) search, some of these barely exist yet, so you’re stuck rolling your own. Hiring the right expertise is expensive and fraught with error.

Retail reality

To influence a specific, individual consumer who walks into a physical store, the cost is high and the effectiveness is low. That’s why most brick-and-mortar businesses tend to use advanced data—such as how much time people spend in which part of the store and what products influenced that decision—at a more statistical level to make systemic changes and affect ad and product placement.

Online retailers have a greater opportunity to influence people at a personal level, but most of that opportunity is in ad placement, feature improvements, and (ahem) search optimization. As for physical stores, eventually, you may well see a price drop before your eyes as some massive cloud determines the tipping point for you to buy on impulse. But don’t expect it to happen anytime soon.

Source: InfoWorld Big Data

IBM sets up a machine learning pipeline for z/OS

IBM sets up a machine learning pipeline for z/OS

If you’re intrigued by IBM’s Watson AI as a service, but reluctant to trust IBM with your data, Big Blue has a compromise. It’s packaging Watson’s core machine learning technology as an end-to-end solution available behind your firewall.

Now the bad news: It’ll only be available to z System / z/OS mainframe users … for now.

From start to finish

IBM Machine Learning for z/OS  isn’t a single machine learning framework. It’s  a collection of popular frameworks — in particular Apache SparkML, TensorFlow, and H2O — packaged with bindings to common languages used in the trade (Python, Java, Scala), and with support for “any transactional data type.” IBM is pushing it as a pipeline for building, managing, and running machine learning models through visual tools for each step of the process and RESTful APIs for deployment and management.

There’s a real need for this kind of convenience. Even as the number of frameworks for machine learning mushrooms, developers still have to perform a lot of heavy labor to create end-to-end production pipelines for training and working with models. This is why Baidu outfitted its PaddlePaddle deep learning framework with support for Kubernetes; in time the arrangement could serve as the underpinning for a complete solution that would cover every phase of machine learning.

Other components in IBM Machine Learning fit into this overall picture. The Cognitive Automation for Data Scientists element “assists data scientists in choosing the right algorithm for the data by scoring their data against the available algorithms and providing the best match for their needs,” checking metrics like performance and fitness to task for a given algorithm and workload.

Another function “schedule[s] continuous re-evaluations on new data to monitor model accuracy over time and be alerted when performance deteriorates.” Models trained on data, rather than algorithms themselves, are truly crucial in any machine learning deployment, so IBM’s wise to provide such utilities.

z/OS for starters; Watson it ain’t

The decision to limit the offering to z System machines for now makes the most sense as part of a general IBM strategy where machine learning advances are paired directly with branded hardware offerings. IBM’s PowerAI system also pairs custom IBM hardware — in this case, the Power8 processor — with commodity Nvidia GPUs to train models at high speed. In theory, PowerAI devices could run side by side with a mix of other, more mainstream hardware as part of an overall machine learning hardware array.

The z/OS incarnation of IBM Machine Learning is aimed at an even higher and narrower market: existing z/OS customers with tons of on-prem data. Rather than ask those (paying) customers to connect to something outside of their firewalls, IBM offers them first crack at tooling to help them get more from the data. The wording of IBM’s announcement — “initially make [IBM Machine Learning] available [on z/OS]” — implies that other targets are possible later on.

It’s also premature to read this as “IBM Watson behind the firewall,” since Watson’s appeal isn’t the algorithms themselves or the workflow IBM’s put together for them, but rather the volumes of pretrained data assembled by IBM, packaged into models and deployed through APIs. Those will remain exactly where IBM can monetize them best: behind its own firewall of IBM Watson as a service.

Source: InfoWorld Big Data

HPE acquires security startup Niara to boost its ClearPass portfolio

HPE acquires security startup Niara to boost its ClearPass portfolio

Hewlett Packard Enterprise has acquired Niara, a startup that uses machine learning and big data analytics on enterprise packet streams and log streams to detect and protect customers from advanced cyberattacks that have penetrated perimeter defenses.

The financial terms of the deal were not disclosed.

Operating in the User and Entity Behavior Analytics (UEBA) market, Niara’s technology starts by automatically establishing baseline characteristics for all users and devices across the enterprise and then looking for anomalous, inconsistent activities that may indicate a security threat, Keerti Melkote, senior vice president and general manager of HPE Aruba and cofounder of Aruba Networks, wrote in a blog post on Wednesday.

The time taken to investigate individual security incidents has been reduced from up to 25 hours using manual processes to less than a minute by using machine learning, Melkote added. 

Hewlett Packard acquired wireless networking company Aruba Networks in May 2015, ahead of its corporate split into HPE, an enterprise-focused business and HP, a business focused on PCs and printers.

The strategy now is to integrate Niara’s behavioral analytics technology with Aruba’s ClearPass Policy Manager, a role and device-based network access control platform, so as to to offer customers advanced threat detection and prevention for network security in wired and wireless environments, and internet of things (IoT) devices, Melkote wrote.

For Niara’s CEO Sriram Ramachandran and Vice President for Engineering Prasad Palkar and several other engineers it is a homecoming. They are part of the team that developed the core technologies in the ArubaOS operating system.

Niara technology addresses the need to monitor a device after it is on the internal network, following authentication by a network access control platform like ClearPass. Niara claims that it detects compromised users, systems or devices by aggregating and putting into context even subtle changes in typical IT access and usage.

Most networks today allow the traffic to flow freely between source and destination once devices are on the network, with internal controls, such as Access Control Lists, used to protect some types of traffic, while others flow freely, Melkote wrote.

“More importantly, none of this traffic is analyzed to detect advanced attacks that have penetrated perimeter security systems and actively seek out weaknesses to exploit on the interior network,” she added.

Source: InfoWorld Big Data

New big data tools for machine learning spring from home of Spark and Mesos

New big data tools for machine learning spring from home of Spark and Mesos

If the University of California, Berkeley’s AMPLab doesn’t ring bells, perhaps some of its projects will: Spark and Mesos.

AMPLab was planned all along as a five-year computer science research initiative, and it closed down as of last November after running its course. But a new lab is opening in its wake: RISELab, another five-year project at UC Berkeley with major financial backing and the stated goal of “focus[ing] intensely for five years on systems that provide Real-time Intelligence with Secure Execution [RISE].”

AMPLab was created with “a vision of understanding how machines and people could come together to process or to address problems in data — to use data to train rich models, to clean data, and to scale these things,” said Joseph E. Gonzalez, Assistant Professor in the Department of Electrical Engineering and Computer Science at UC Berkeley.

RISELab’s web page describes the group’s mission as “a proactive step to move beyond big data analytics into a more immersive world,” where “sensors are everywhere, AI is real, and the world is programmable.” One example cited: Managing the data infrastructure around “small, autonomous aerial vehicles,” whether unmanned drones or flying cars, where the data has to be processed securely at high speed.

Other big challenges Gonzalez singled out include security, but not the conventional focus on access controls. Rather, it involves concepts like “homomorphic” encryption, where encrypted data can be worked without first having to decrypt it. “How can we make predictions on data in the cloud,” said Gonzalez, “without the cloud understanding what it is it’s making predictions about?”

Though the lab is in its early days, a few projects have already started to emerge:

Clipper

Machine learning involves two basic kinds of work: Creating models from which predictions can be derived and serving up those predictions from the models. Clipper focuses on the second task and is described as a “general-purpose low-latency prediction serving system” that takes predictions from machine learning frameworks and serves them up with minimal latency.

Clipper has three aims that ought to draw the attention of anyone working with machine learning: One, it accelerates serving up predictions from a trained model. Two, it provides an abstraction layer across multiple machine learning frameworks, so a developer only has to program to a single API. Three, Clipper’s design makes it possible to respond dynamically to how individual models respond to requests — for instance, to allow a given model that works better for a particular class of problem to receive priority. Right now there’s no explicit mechanism for this, but it is a future possibility.

Opaque

It seems fitting that a RISELab projects would complement work done by AMPLab, and one does: Opaque works with Apache Spark SQL to enable “very strong security for DataFrames.” It uses Intel SGX processor extensions to allow DataFrames to be marked as encrypted and have all their operations performed within an “SGX enclave,” where data is encrypted in-place using the AES algorithm and is only visible to the application using it via hardware-level protection.

Gonzalez says this delivers the benefits of homomorphic encryption without the performance cost. The performance hit for using SGX is around 50 percent, but the fastest current implementations of homomorphic algorithms run 20,000 times slower. On the other hand, SGX-enabled processors are not yet offered in the cloud, although Gonzalez said this is slated to happen “in the near future.” The biggest stumbling block, though, may be the implementation, since in order for this to work, “you have to trust Intel,” as Gonzalez pointed out.

Ground

Ground is a context management system for data lakes. It provides a mechanism, implemented as a RESTful service in Java, that “enables users to reason about what data they have, where that data is flowing to and from, who is using the data, when the data changed, and why and how the data is changing.”

Gonzalez noted that data aggregation has moved away from strict, data-warehouse-style governance and toward “very open and flexible data lakes,” but that makes it “hard to track how the data came to be.” In some ways, he pointed out, knowing who changed a given set of data and how it was changed can be more important than the data itself. Ground provides a common API and meta model for track such information, and it works with many data repositories. (The Git version control system, for instance, is one of the supported data formats in the early alpha version of the project.)

Gonzalez admitted that defining RISELab’s goals can be tricky, but he noted that “at its core is this transition from how we build advanced analytics models, how we analyze data, to how we use that insight to make decisions — connecting the products of Spark to the world, the products of large-scale analytics.”

Source: InfoWorld Big Data

Review: The best frameworks for machine learning and deep learning

Review: The best frameworks for machine learning and deep learning

Over the past year I’ve reviewed half a dozen open source machine learning and/or deep learning frameworks: Caffe, Microsoft Cognitive Toolkit (aka CNTK 2), MXNet, Scikit-learn, Spark MLlib, and TensorFlow. If I had cast my net even wider, I might well have covered a few other popular frameworks, including Theano (a 10-year-old Python deep learning and machine learning framework), Keras (a deep learning front end for Theano and TensorFlow), and DeepLearning4j (deep learning software for Java and Scala on Hadoop and Spark). If you’re interested in working with machine learning and neural networks, you’ve never had a richer array of options.  

There’s a difference between a machine learning framework and a deep learning framework. Essentially, a machine learning framework covers a variety of learning methods for classification, regression, clustering, anomaly detection, and data preparation, and it may or may not include neural network methods. A deep learning or deep neural network (DNN) framework covers a variety of neural network topologies with many hidden layers. These layers comprise a multistep process of pattern recognition. The more layers in the network, the more complex the features that can be extracted for clustering and classification.

Source: InfoWorld Big Data

SAP adds new enterprise information management

SAP adds new enterprise information management

SAP yesterday renewed its enterprise information management (EIM) portfolio with a series of updates aimed at helping organizations better manage, govern and strategically use and control their data assets.

“By effectively managing enterprise data to deliver trusted, complete and relevant information, organizations can ensure data is always actionable to gain business insight and drive innovation,” says Philip On, vice president of Product Marketing at SAP.

The additions to the EIM portfolio are intended to provide customers with enhanced support and connectivity for big data sources, improved data stewardship and metadata management capabilities and a pay-as-you-go cloud data quality service, he adds.

The updates to the EIM portfolio include the following features:

  • SAP Data Services. Providing extended support and connectivity for integrating and loading large and diverse data types, SAP Data Services includes a data extraction capability for fast data transfer from Google BigQuery to data processing systems like Hadoop, SAP HANA Vora, SAP IQ, SAP HANA and other cloud storage. Other enhancements include optimizing data extraction from a HIVE table using Spark and new connectivity support for Amazon Redshift and Apache Cassandra.
  • SAP Information Steward. The latest version helps speed data resolution issues with better usability, policy and workflow processes. You can immediately view and share data quality scorecards across devices without having to log into the application. You can also more easily access information policies while viewing rules, scorecards, metadata and terms to immediately verify compliance. New information policy web services allow policies outside of the application to be viewed anywhere such as corporate portals. Finally, new and enhanced metadata management capabilities provide data stewards and IT users a way to quickly search metadata and conduct more meaningful metadata discovery.
  • SAP Agile Data Preparation. To improve collaboration capabilities between business users and data stewards, SAP Agile Data Preparation focuses on the bridge between agile business data mash-ups and central corporate governance. It allows you to share, export and import rules between different worksheets or between different data domains. The rules are shared through a central and managed repository as well as through the capability to import or export the rules using flat files. New data remediation capabilities were added allowing you to change the values of a given cell by just double clicking it, add a new column and populate with relevant data values, or add or remove records in a single action.
  • SAP HANA smart data integration and smart data quality. The latest release of the SAP HANA platform features new performance and connectivity functionality to deliver faster, more robust real-time replication, bulk/batch data movement, data virtualization and data quality through one common user interface.
  • SAP Data Quality Management microservices. This new cloud-based offering is available as a beta on SAP HANA Cloud Platform, developer edition. It’s a pay-as-you-go cloud-based service that ensures clean data by providing data validation and enrichment for addresses and geocodes within any application or environment.

“As organizations are moving to the cloud and digital business, the data foundation is so important,” On says. “It’s not just having the data, but having the right data. We want to give them a suite of solutions that truly allow them to deliver information excellence from the beginning to the end.”

On says SAP Data Quality Management microservices will be available later in the first quarter. The other offerings are all immediately available.

This story, “SAP adds new enterprise information management” was originally published by CIO.

Source: InfoWorld Big Data

Hadoop vendors make a jumble of security

Hadoop vendors make a jumble of security

A year ago a Deutsche Bank survey of CIOs found that “CIOs are now broadly comfortable with [Hadoop] and see it as a significant part of the future data architecture.” They’re so comfortable, in fact, that many CIOs haven’t thought to question Hadoop’s built-in security, leading Gartner analyst Merv Adrian to query, “Can it be that people believe Hadoop is secure? Because it certainly is not.”

That was then, this is now, and the primary Hadoop vendors are getting serious about security. That’s the good news. The bad, however, is that they’re approaching Hadoop security in significantly different ways, which promises to turn big data’s open source poster child into a potential pitfall for vendor lock-in.

Can’t we all get along?

That’s the conclusion reached in a Gartner research note authored by Adrian. As he writes, “Hadoop security stacks emerging from three independent distributors remain immature and are not comprehensive; they are therefore likely to create incompatible, inflexible deployments and promote vendor lock-in.” This is, of course, standard operating procedure in databases or data warehouses, but it calls into question some of the benefit of building on an open source “standard” like Hadoop.

Ironically, it’s the very openness of Hadoop that creates this proprietary potential.

It starts with the inherent insecurity of Hadoop, which has come to light with recent ransomware attacks. Hadoop hasn’t traditionally come with built-in security, yet Hadoop systems “increase utilization of file system-based data that is not otherwise protected,” as Adrian explains, allowing “new vulnerabilities [to] emerge that compromise carefully crafted data security regimes.” It gets worse.

Organizations are increasingly turning to Hadoop to create “data lakes.” Unlike databases, which Adrian says tend to contain “known data that conforms to predetermined policies about quality, ownership, and standards,” data lakes encourage data of indeterminate quality or provenance. Though the Hadoop community has promising projects like Apache Eagle (which uses machine intelligence to identify security threats to Hadoop clusters), the Hadoop community has yet to offer a unified solution to lock down such data and, worse, is offering a mishmash of competing alternatives, as Adrian describes:

Big data security, in short, is a big mess.

Love that lock-in

The specter of lock-in is real, but is it scary? I’ve argued before that lock-in is a fact of enterprise IT, made no better (or worse) by open source … or cloud or any other trend in IT. Once an enterprise has invested money, people, and other resources into making a system work, it’s effectively locked in.

Still, there’s arguably more at stake when a company puts petabytes of data into a Hadoop data lake versus running an open source content management system or even an operating system. The heart of any business is its data, and getting boxed into a particular Hadoop vendor because an enterprise becomes dependent on its particular approach to securing Hadoop clusters seems like a big deal.

But is it really?

Oracle, after all, makes billions of dollars “locking in” customers to its very proprietary database, so much so that it had double the market share (41.6 percent) of its nearest competitor (Microsoft at 19.4 percent) as of April 2016, according to Gartner’s research. If enterprises are worried about lock-in, they have a weird way of showing it.

For me the bigger issue isn’t lock-in, but rather that the competing approaches to Hadoop security may actually yield poorer security, at least in the short term. The enterprises that deploy more than one Hadoop stack (a common occurrence) will need to juggle the conflicting security approaches and almost certainly leave holes. Those that standardize on one vendor will be stuck with incomplete security solutions.

Over time, this will improve. There’s simply too much money at stake for the on-prem and cloud-based Hadoop vendors. But for the moment, enterprises should continue to worry about Hadoop security.

Source: InfoWorld Big Data

Apache Eagle keeps an eye on big data usage

Apache Eagle keeps an eye on big data usage

Apache Eagle, originally developed at eBay and then donated to the Apache Software Foundation, fills big data security niche that remains thinly populated, if not bare: It sniffs out possible security and performance issues with big data frameworks.

To do this, Eagle uses other Apache open source components, such as Kafka, Spark, and Storm, to generate and analyze machine learning models from the behavioral data of big data clusters.

Looking in from the inside

Data for Eagle can come from activity logs for various data source (HDFS, Hive, MapR FS, Cassandra, etc.) or from performance metrics harvested directly from frameworks like Spark. The data can then be piped by the Kafka streaming framework into a real-time detection system that’s built with Apache Storm, or into a model-training system built on Apache Spark. The former’s for generating alerts and reports based on existing policies; the latter is for creating machine learning models to drive new policies.

This emphasis on real-time behavior tops the list of “key qualities” in the documentation for Eagle. It’s followed by “scalability,” “metadata driven” (meaning changes to policies are deployed automatically when their metadata is changed), and “extensibility.” This last means the data sources, alerting systems, and policy engines used by Eagle are supplied by plugins and aren’t limited to what’s in the box.

Because Eagle’s been put together from existing parts of the Hadoop world, it has two theoretical advantages. One, there’s less reinvention of the wheel. Two, those who already have experience with the pieces in question will have a leg up.

What are my people up to?

Aside from the above-mentioned use cases like analyzing job performance and monitoring for anomalous behavior, Eagle can also analyze user behaviors. This isn’t about, say, analyzing data from a web application to learn about the public users of that app, but rather the users of the big data framework itself — the folks building and managing the Hadoop or Spark back end. An example of how to run such analysis is included, and it could be deployed as-is or modified.

Eagle also allows application data access to be classified according to levels of sensitivity. Only HDFS, Hive, and HBase applications can make use of this feature right now, but its interaction with them provides a model for how other data sources could also be classified.

Let’s keep this under control

Because big data frameworks are fast-moving creations, it’s been tough to build reliable security around them. Eagle’s premise is that it can provide policy-based analysis and alerting as a possible complement to other projects like Apache Ranger. Ranger provides authentication and access control across Hadoop and its related technologies; Eagle gives you some idea of what people are doing once they’re allowed inside.

The biggest question hovering over Eagle’s future — yes, even this early on — is to what degree Hadoop vendors will elegantly roll it into their existing distributions, or use their own security offerings. Data security and governance have long been one of the missing pieces that commercial offerings could compete on.

Source: InfoWorld Big Data