7 big data tools to ditch in 2017

7 big data tools to ditch in 2017

We’ve been on this big data adventure for a while. Not everything is still shiny and new anymore. In fact, some technologies may be holding you back. Remember, this is the fastest-moving area of enterprise tech — so much so that some software acts as a placeholder until better bits arrive.

Those upgrades — or replacements — can make the difference between a successful big data initiative and one you’ll be living down for the next few years. Here’s are some elements of the stack you should start to think about replacing:

1. MapReduce. MapReduce is slow. It’s rarely the best way to go about a problem. There are other algorithms to choose from — the most common is DAG, of which MapReduce can be considered a subset. If you’ve done a bunch of custom MapReduce jobs, the performance difference compared to Spark is worth the cost and trouble of switching.

2. Storm. I’m not saying Spark will eat the streaming world, although it might, but with technologies like Apex and Flink there are better, lower-latency alternatives to Spark than Storm. Besides, you should probably evaluate your latency tolerance and whether the bugs you have in your lower-level, more complicated code are worth a few extra milliseconds. Storm doesn’t have the support that it could, with Hortonworks as the only real backer — and with Hortonworks facing increasing market pressure, Storm is unlikely to get more attention.

3. Pig. Pig kind of blows. You can do anything it does with Spark or other technologies. At first Pig seems like a nice “PL/SQL for big data,” but you quickly find out it’s a little bizarre.

4. Java. No, not the JVM, but the language. The syntax is clunky for big data jobs. Plus, newer constructs like Lambda have been bolted onto the side in a somewhat awkward manner. The big data world has largely moved to Scala and Python (the latter when you can afford the performance hit and need Python libraries or are infested with Python developers). Of course, you can use R for stats, until you rewrite it in Python because R doesn’t have all the fun scale features.

5. Tez. This is another Hortonworks pet project. It’s a DAG implementation, but unlike Spark, Tez is described by one of its developers as like writing in “assembly language.” At the moment, with a Hortonworks distribution, you’ll end up using Tez behind Hive and other tools — but you can already use Spark as the engine in other distributions. Tez has always been kind of buggy anyhow. Again, this is one vendor’s project and doesn’t have the industry or community support of other technologies. It doesn’t have any runaway advantages over other solutions. This is an engine I’d look to consolidate out.

6. Oozie. I’ve long hated on Oozie. It isn’t much of a workflow engine or much of a scheduler — yet it’s both and neither at the same time! It is, however, a collection of bugs for a piece of software that shouldn’t be that hard to write. Between StreamSets, DAG implementations, and all, you should have ways to do most of what Oozie does.

7. Flume. Between StreamSets and Kafka and other solutions, you probably have an alternative to Flume. That May 20, 2015, release is looking a bit rusty. You can track the year-on-year activity level. Hearts and minds have left. It’s probably time to move on.

Maybe by 2018 …

What’s left? Some technology is showing its age, but complete viable alternatives have not arrived yet. Think ahead about replacing these:

1. Hive. This is overly snarky, but Hive is like the least performant distributed database on the planet. If we hadn’t as an industry decided RDBMSes were the greatest thing since sliced bread for like 40 years, then would we really have created this monster?

2. HDFS. Writing a system-level service in Java is not the greatest of ideas. Java’s memory management also makes pushing massive amounts of bytes around a bit slow. The way the HDFS NameNode works is not ideal for anything and constitutes a bottleneck. Various vendors have workarounds to make this better, but honestly, nicer things are available. There are other distributed filesystems. MaprFS is a pretty well-designed one. There’s also Gluster and a slew of others.

Your gripes here

With an eye to the future, it’s time to cull the herd of technologies that looked promising but have grown either obsolete or rusty. This is my list. What else should I add?

Source: InfoWorld Big Data

8 'new' enterprise products we don't want to see

8 'new' enterprise products we don't want to see

I get a lot of press releases. Most of them are from startups with the same old enterprise product ideas under different names. Some are for “new” products from existing companies (by “new,” I mean new implementations of old ideas).

Think you have a great idea? Please tell me it isn’t one of these:

1. A column family or key value store database

You have a brand-new take on how to store data, and it starts with keys associated with something. It’s revolutionary because blah blah blah.

No — stop it. Don’t start any more of these; don’t fund any more of these. That ship has sailed; the market is beyond saturated.

2. ETL/monitoring/data catalogs

The market might bear a totally new approach, but I’ve yet to see one (I mean actually a new approach, not simply saying that). I recently watched a vendor drone on for more than an hour before telling us what it was pitching. The more times a vendor says “revolutionary,” the more you know the only thing that’s new is the pricing. It’s an ETL tool with a catalog and monitoring that only works with their cloud, but they support open source and community! Sad, man.

Seriously, you can’t dress up your ETL/governance tool as a brand-new product idea — you’ve now invented Informatica. I’m not saying you should use Informatica, I’d never say that, but I’m saying “Zzzz, don’t start another one.” If you’re a big enough vendor to build your own, that’s nice, but no one cares.

3. On-prem clouds

OpenShift, CloudFoundry, and so on have all become “new and interesting ways to manage Docker or Docker images.” Also, we say “hybrid” because if you try hard you might get it up to Amazon, but the tools for doing that will certainly suck. Frankly, I’m skeptical that the “hybrid cloud” is anything but a silly marketing gimmick in terms of practicality, implementation, or utility.

4. Hadoop/Spark management with performance enhancements

Management in this area is a real problem, but if you’re starting now, you’re late to the game. This is a niche market. [Disclosure: I’m an adviser for one of these.]

5. Generic data visualization tool

In truth, I’m not superhappy with any product in this area (Tableau in particular sucks). This is a market that has had 1,000 false starts along with a handful of good players that charge too much. Amazon and others are getting into this game as well, although I’m dubious anyone wants to pay by the cycle to draw a chart. Anyhow, the usefulness of these tools will fade as we move to more automated decision-making tools.

6. Content management systems by any other name

People are still writing me about how they started these things. They have new names for them — but no, I’m not writing about them. If I covered consumer electronics I probably wouldn’t write about the various toasters you can buy at Target either. Are you people joking?

7. Another streaming tool

Between Kafka, Spark, Apex, Storm, and so on, whatever you need in big data software is covered. Your “revolutionary” new way to stream is probably not new.

8. Server-side blah blah with mobile added

Yes, mobile exists, but with maybe one or two exceptions a mobile app is mainly a client to the server like a web browser. If this means you added sync or notification to your existing product, cool. If you launched a new product line with “mobile” on it, please sell this to journalists and analysts with no technical background.

If you’re about to build any of those, please stop. Don’t tell anyone about it. Walk away from the keyboard before you bore someone.

Source: InfoWorld Big Data

Bossie Awards 2016: The best open source big data tools

Bossie Awards 2016: The best open source big data tools

Elasticsearch, also based on the Apache Lucene engine, is an open source distributed search engine that focuses on modern concepts like REST APIs and JSON documents. Its approach to scaling makes it easy to take Elasticsearch clusters from gigabytes to petabytes of data with low operational overhead.

As part of the ELK stack (Elasticsearch, Logstash, and Kibana, all developed by Elasticsearch’s creators, Elastic), Elasticsearch has found its killer app as an open source Splunk replacement for log analysis. Companies like Netflix, Facebook, Microsoft, and LinkedIn run large Elasticsearch clusters for their logging infrastructure. Furthermore, the ELK stack is finding its way into other domains, such as fraud detection and domain-specific business analytics, spreading the use of Elasticsearch throughout the enterprise.

— Ian Pointer

This article appears to continue on subsequent pages which we could not extract

Source: InfoWorld Big Data

Big data problem? Don't forget search

Big data problem? Don't forget search

With every cool new technology, people get overly infatuated and start using it for the wrong things. For example: Looking through a bazillion records for a few million marked with a set of criteria is a rather stupid use of MapReduce or your favorite DAG implementation (see: Spark).

For that and similar tasks, don’t forget the original big data technology: search. With great open source tools like Solr, Lucidworks, and Elasticsearch, you have a powerful way to optimize your I/O and personalize your user experience. It’s much better than holding fancy new tools from the wrong end.

A bad job for Spark

Not long ago a client asked me how to use Spark to search through a bunch of data they’d streamed into a NoSQL database. The trouble was that their pattern was a simple string search and a drill-down. It was beyond the capabilities of the database to do efficiently: They would have to pull all the data out of storage and parse through it in memory. Even with a DAG it was a little slow (not to mention expensive) on AWS.

Spark is great when you can put a defined data set in memory. Spark is not so great at sucking up the world, in part because in memory analytics are only as good as your ability to transfer everything to memory and pay for that memory. We still need to think about storage and how to organize it in a way that gets us what we need quickly and cleanly.

With big data, CEOs find garbage in is still garbage out

With big data, CEOs find garbage in is still garbage out

Another day, another CEO survey. This one, from KPMG, finds that CEOs don’t trust their analytics, the way their team is using or implementing them, or even the data used to make decisions in the first place. In fact, only 31 percent of respondents see their organizations as leaders in the use of data and analytics.

You have to ask: What’s the CEO’s culpability in all that?

Despite the evidence that math makes better decisions than gut calls, many companies haven’t gotten there yet. The approach many are taking is still the wrong one. Deploying big data infrastructure with no plan and no use cases will go the way any IT project with no plan or destination in mind goes.

What’s funny about KPMG’s survey is that despite the lack of trust in both the analytics and the data, a set of very modern concerns emerge: Customer loyalty, understanding millennials, projecting relevance of current products/services, and understanding customer needs/expectations. You know what you need to do to achieve those things? Fix how you collect data and perform analytics. You know who should really push for that and view themselves as leading that charge? The CEO of any decent company (aided and abetted by a CIO, CDO, CTO, and so on).

How to get your mainframe's data for Hadoop analytics

How to get your mainframe's data for Hadoop analytics

Many so-called big data — really, Hadoop — projects have patterns. Many are merely enterprise integration patterns that have been refactored and rebranded. Of those, the most common is the mainframe pattern.

Because most organizations run the mainframe and its software as a giant single point of failure, the mainframe team hates everyone. Its members hate change, and they don’t want to give you access to anything. However, there is a lot of data on that mainframe and, if it can be done gently, the mainframe team is interested in people learning to use the system rather than start from the beginning. After all, the company has only begun to scratch the surface of what the mainframe and the existing system have available.

There are many great techniques that can’t be used for data integration in an environment where new software installs are highly discouraged, such as in the case of the mainframe pattern. However, rest assured that there are a lot of techniques to get around these limitations.

Sometimes the goal of mainframe-Hadoop or mainframe-Spark projects is just to look at the current state of the world. However, more frequently they want to do trend analysis and track changes in a way that the existing system doesn’t do. This requires techniques covered by change data capture (CDC).

HDFS: Big data analytics' weakest link

HDFS: Big data analytics' weakest link

For large-scale analytics, a distributed file system is kind of important. Even if you’re using Spark you need to pull a lot of data into memory very quickly. Having a file system that supports high burst rates — up to network saturation — is a good thing. However, Hadoop’s eponymous file system (Hadoop Distributed File System, aka HDFS) may not be all it’s cracked up to be.

What is a distributed file system? Think of your normal file system, which stores files in blocks. It has some way of noting where on the physical disk a block starts and how that block matches to a file. (One implementation is a file allocation table or FAT of sorts.) In a distributed file system, the blocks are “distributed” among disks attached to multiple computers. Additionally, like RAID or most SAN systems, the blocks are duplicated so that if a node is lost from the network then no data is lost.

What’s wrong with HDFS?

In HDFS, the role of the “file allocation table” is taken by the namenode. You can have more than one namenode (for redundancy), but essentially the namenode constitutes both a failure point and a type of bottleneck. While a namenode can fail over, that does take time. It also means keeping in sequence, which introduces more latency. In HDFS there is also some threading and locking stuff that happens as well as the fact that it is garbage-collected Java. Garbage collection — especially Java garbage collection — requires a lot of memory (generally at least 10x to be as efficient as native memory).

Moreover, in developing applications for distributed computing we often figure that whatever inefficiency we inject in language choice will be outweighed by I/O. Meaning so what if it took me 1,000 operations to open a file and give you some data, because the time it took for an I/O operation was 10x that. Simplistically speaking, the higher level the language, the more operations or “work” is executed per line of code.

5 big data sources for strategic sentiment analysis

5 big data sources for strategic sentiment analysis

Somewhere, someone is tweeting “[This airline] sucks the big one!” In the past, they would have been ignored. These days many airlines respond with sympathy (“We’re so sorry you’re having a rough trip — please DM us, so we can resolve it”) or send an invitation to call an 800-number (where you can wait on hold forever).

A tool called sentiment analysis, or the mathematical categorization of statements’ negative or positive connotations, gives companies powerful ways to analyze aggregate language data across all sorts of communications, not only tweets. There’s real value in measuring sentiment inside and outside your company. Here are five of the most valuable sentiment sources to tap.

Customer inquiries

When a customer asks about your product or services, metrics on overall sentiment, the length of the message, and words used can be compared to past inquiries. Different inquiries warrant different treatment.

Customer service

When a customer writes in about a problem, is he or she really upset or simply asking, “Hi, can you look into this?” Sentiment analysis of these interactions helps track the way customers feel about your company or product over time. Is your relationship solid? When interacting with an inexperienced operator, do customers walk away satisfied?

We have the big data tools — let's learn to use them

We have the big data tools — let's learn to use them

Recently, at the Apache Spark Maker Community event in San Francisco, I was on a panel and feeling a bit salty. It seems many people have prematurely declared victory in the data game. A few people have achieved self-service, and even more have claimed to.

In truth, this is a tiny minority — and most of those people have achieved cargo-cult datacentricity. They use Hadoop and/or Spark and pull data into Excel, manipulate it, and paste it into PowerPoint. Maybe they’ve added Tableau and are able to make prettier charts, but what really has changed? Jack, that’s what.

Self-service is only step one on this trip to data-driven decision-making. Companies need to know their data before they can consider their choices — but this is still very much data at the edges with a meat cloud in the center.

So far, we use computer aided decision-making and computer-driven process where we have to: advanced fraud detection, algorithmic trading, and rigorously regulated processes (such as Obamacare). Generally, we don’t use it elsewhere.

OK computer: When pop music meets machine learning

OK computer: When pop music meets machine learning

It’s Moogfest season here in Durham, so there’s been a lot of the discussion in the office around music, data lakes, and the heat map we’re building for the festival. But the conversation took a different turn, thanks to a tweet.

Many months ago when I was at IBM Insight, I tweeted a snide remark about computer-generated jokes. Fast-forward to this week, when former “Monk” and Letterman writer Joe Toplyn responded with a link “proving” that computers could generate jokes that were funny … at least to the easily amused. Amid the discussion, someone drove by playing crappy autotune pop music.

This got me thinking about whether you could generate hit pop songs. Most of the popular songs are written by two middle-aged guys from Sweden anyhow. Plus, there are algorithms that can detect which songs are likely to be a hit. While the current hit song generator is simply song titles with performers, we also have an algorithm that can generate tweets for the presumptive Republican presidential nominee. It seems like a short trip to get from hit detector to factory songwriting to neural net for political speech to full-on pop song generator!

We’d need parameters like a genre (pop, hip-hop, dance) and probably gender, as well as whether it’s a party track, a love song, happy, sad, angry, and so on. Then maybe we’d train a neural net on the corpus of songs by the two Swedes. Add that to an adaptation of the hit detection algorithm and you should have not a great song, but at the very least a popular one.