7 big data tools to ditch in 2017

October 6, 2016 by Andrew C Oliver Posted in Industry Insights & News

We’ve been on this big data adventure for a while. Not everything is still shiny and new anymore. In fact, some technologies may be holding you back. Remember, this is the fastest-moving area of enterprise tech — so much so that some software acts as a placeholder until better bits arrive.

Those upgrades — or replacements — can make the difference between a successful big data initiative and one you’ll be living down for the next few years. Here’s are some elements of the stack you should start to think about replacing:

1. MapReduce. MapReduce is slow. It’s rarely the best way to go about a problem. There are other algorithms to choose from — the most common is DAG, of which MapReduce can be considered a subset. If you’ve done a bunch of custom MapReduce jobs, the performance difference compared to Spark is worth the cost and trouble of switching.

2. Storm. I’m not saying Spark will eat the streaming world, although it might, but with technologies like Apex and Flink there are better, lower-latency alternatives to Spark than Storm. Besides, you should probably evaluate your latency tolerance and whether the bugs you have in your lower-level, more complicated code are worth a few extra milliseconds. Storm doesn’t have the support that it could, with Hortonworks as the only real backer — and with Hortonworks facing increasing market pressure, Storm is unlikely to get more attention.

3. Pig. Pig kind of blows. You can do anything it does with Spark or other technologies. At first Pig seems like a nice “PL/SQL for big data,” but you quickly find out it’s a little bizarre.

4. Java. No, not the JVM, but the language. The syntax is clunky for big data jobs. Plus, newer constructs like Lambda have been bolted onto the side in a somewhat awkward manner. The big data world has largely moved to Scala and Python (the latter when you can afford the performance hit and need Python libraries or are infested with Python developers). Of course, you can use R for stats, until you rewrite it in Python because R doesn’t have all the fun scale features.

5. Tez. This is another Hortonworks pet project. It’s a DAG implementation, but unlike Spark, Tez is described by one of its developers as like writing in “assembly language.” At the moment, with a Hortonworks distribution, you’ll end up using Tez behind Hive and other tools — but you can already use Spark as the engine in other distributions. Tez has always been kind of buggy anyhow. Again, this is one vendor’s project and doesn’t have the industry or community support of other technologies. It doesn’t have any runaway advantages over other solutions. This is an engine I’d look to consolidate out.

6. Oozie. I’ve long hated on Oozie. It isn’t much of a workflow engine or much of a scheduler — yet it’s both and neither at the same time! It is, however, a collection of bugs for a piece of software that shouldn’t be that hard to write. Between StreamSets, DAG implementations, and all, you should have ways to do most of what Oozie does.

7. Flume. Between StreamSets and Kafka and other solutions, you probably have an alternative to Flume. That May 20, 2015, release is looking a bit rusty. You can track the year-on-year activity level. Hearts and minds have left. It’s probably time to move on.

Maybe by 2018 …

What’s left? Some technology is showing its age, but complete viable alternatives have not arrived yet. Think ahead about replacing these:

1. Hive. This is overly snarky, but Hive is like the least performant distributed database on the planet. If we hadn’t as an industry decided RDBMSes were the greatest thing since sliced bread for like 40 years, then would we really have created this monster?

2. HDFS. Writing a system-level service in Java is not the greatest of ideas. Java’s memory management also makes pushing massive amounts of bytes around a bit slow. The way the HDFS NameNode works is not ideal for anything and constitutes a bottleneck. Various vendors have workarounds to make this better, but honestly, nicer things are available. There are other distributed filesystems. MaprFS is a pretty well-designed one. There’s also Gluster and a slew of others.

Your gripes here

With an eye to the future, it’s time to cull the herd of technologies that looked promising but have grown either obsolete or rusty. This is my list. What else should I add?

Source: InfoWorld Big Data

8 'new' enterprise products we don't want to see

September 29, 2016 by Andrew C Oliver Posted in Industry Insights & News

8 'new' enterprise products we don't want to see

I get a lot of press releases. Most of them are from startups with the same old enterprise product ideas under different names. Some are for “new” products from existing companies (by “new,” I mean new implementations of old ideas).

Think you have a great idea? Please tell me it isn’t one of these:

1. A column family or key value store database

You have a brand-new take on how to store data, and it starts with keys associated with something. It’s revolutionary because blah blah blah.

No — stop it. Don’t start any more of these; don’t fund any more of these. That ship has sailed; the market is beyond saturated.

2. ETL/monitoring/data catalogs

The market might bear a totally new approach, but I’ve yet to see one (I mean actually a new approach, not simply saying that). I recently watched a vendor drone on for more than an hour before telling us what it was pitching. The more times a vendor says “revolutionary,” the more you know the only thing that’s new is the pricing. It’s an ETL tool with a catalog and monitoring that only works with their cloud, but they support open source and community! Sad, man.

Seriously, you can’t dress up your ETL/governance tool as a brand-new product idea — you’ve now invented Informatica. I’m not saying you should use Informatica, I’d never say that, but I’m saying “Zzzz, don’t start another one.” If you’re a big enough vendor to build your own, that’s nice, but no one cares.

3. On-prem clouds

OpenShift, CloudFoundry, and so on have all become “new and interesting ways to manage Docker or Docker images.” Also, we say “hybrid” because if you try hard you might get it up to Amazon, but the tools for doing that will certainly suck. Frankly, I’m skeptical that the “hybrid cloud” is anything but a silly marketing gimmick in terms of practicality, implementation, or utility.

4. Hadoop/Spark management with performance enhancements

Management in this area is a real problem, but if you’re starting now, you’re late to the game. This is a niche market. [Disclosure: I’m an adviser for one of these.]

5. Generic data visualization tool

In truth, I’m not superhappy with any product in this area (Tableau in particular sucks). This is a market that has had 1,000 false starts along with a handful of good players that charge too much. Amazon and others are getting into this game as well, although I’m dubious anyone wants to pay by the cycle to draw a chart. Anyhow, the usefulness of these tools will fade as we move to more automated decision-making tools.

6. Content management systems by any other name

People are still writing me about how they started these things. They have new names for them — but no, I’m not writing about them. If I covered consumer electronics I probably wouldn’t write about the various toasters you can buy at Target either. Are you people joking?

7. Another streaming tool

Between Kafka, Spark, Apex, Storm, and so on, whatever you need in big data software is covered. Your “revolutionary” new way to stream is probably not new.

8. Server-side blah blah with mobile added

Yes, mobile exists, but with maybe one or two exceptions a mobile app is mainly a client to the server like a web browser. If this means you added sync or notification to your existing product, cool. If you launched a new product line with “mobile” on it, please sell this to journalists and analysts with no technical background.

If you’re about to build any of those, please stop. Don’t tell anyone about it. Walk away from the keyboard before you bore someone.

Source: InfoWorld Big Data

Bossie Awards 2016: The best open source big data tools

September 21, 2016 by Andrew C Oliver Posted in Industry Insights & News

Bossie Awards 2016: The best open source big data tools

Elasticsearch, also based on the Apache Lucene engine, is an open source distributed search engine that focuses on modern concepts like REST APIs and JSON documents. Its approach to scaling makes it easy to take Elasticsearch clusters from gigabytes to petabytes of data with low operational overhead.

As part of the ELK stack (Elasticsearch, Logstash, and Kibana, all developed by Elasticsearch’s creators, Elastic), Elasticsearch has found its killer app as an open source Splunk replacement for log analysis. Companies like Netflix, Facebook, Microsoft, and LinkedIn run large Elasticsearch clusters for their logging infrastructure. Furthermore, the ELK stack is finding its way into other domains, such as fraud detection and domain-specific business analytics, spreading the use of Elasticsearch throughout the enterprise.

— Ian Pointer

This article appears to continue on subsequent pages which we could not extract

Source: InfoWorld Big Data

Big data problem? Don't forget search

July 29, 2016 by Andrew C Oliver Posted in Industry Insights & News

Big data problem? Don't forget search

With every cool new technology, people get overly infatuated and start using it for the wrong things. For example: Looking through a bazillion records for a few million marked with a set of criteria is a rather stupid use of MapReduce or your favorite DAG implementation (see: Spark).

For that and similar tasks, don’t forget the original big data technology: search. With great open source tools like Solr, Lucidworks, and Elasticsearch, you have a powerful way to optimize your I/O and personalize your user experience. It’s much better than holding fancy new tools from the wrong end.

A bad job for Spark

Not long ago a client asked me how to use Spark to search through a bunch of data they’d streamed into a NoSQL database. The trouble was that their pattern was a simple string search and a drill-down. It was beyond the capabilities of the database to do efficiently: They would have to pull all the data out of storage and parse through it in memory. Even with a DAG it was a little slow (not to mention expensive) on AWS.

Spark is great when you can put a defined data set in memory. Spark is not so great at sucking up the world, in part because in memory analytics are only as good as your ability to transfer everything to memory and pay for that memory. We still need to think about storage and how to organize it in a way that gets us what we need quickly and cleanly.

For that particular client, the answer was to index the data as it came in and pull back a subset for more advanced machine learning — but leave search to a search index.

Search versus machine learning

No clean line exists between search, machine learning, and certain related techniques. Clearly, information that’s textual or linguistic tends to strongly indicate a search problem. Information that is numeric, binary, or simply not textual or linguistic in nature indicate a machine learning (or other) problem. There is overlap. There are even instances, such as anomaly detection, where either technique may be valid to use.

A key question is whether you can pick the right data when you retrieve it from storage as part of the criteria — versus having to munge through the data. For textual or defined numeric data this may be simple. Again, the kind of rules one uses for anomaly detection might lend themselves to search as well.

This approach of course has its limits. If you don’t know what you’re looking for and can’t define the rules very easily, then clearly search isn’t the right tool.

Search plus big data

In many cases, using search with Spark or your favorite machine library may be the ticket. I’ve talked about methods for adding search to Hadoop, but there are also methods for adding Spark, Hadoop, or machine learning to search.

After the dust settled on Spark, anyone working with it realized that it wasn’t magic beans and there are real issues with working in memory. For data you can index, being able to quickly pull back your working set for analysis is far better than a big fat I/O pull into memory to find what you’re looking for.

Search and context

But search isn’t only how you solve your “find my working set,” memory, or I/O issues. One of the weaknesses of most big data projects is the lack of context. I’ve talked about this in terms of security, but what about your user experience? While you’re streaming every little bit of data you can find about the user, how are you working with that to personalize the user experience?

Using the things you know about users (aka signals), you can improve the information you put in front of them. This might mean streaming analytics on the front end of your user interaction and a faceted search on the back end when you show them results or a personalized webpage.

The search solution

As a data architect, engineer, developer, or scientist, you need more than one or two options in your toolbelt. I get very annoyed at the approach of “let’s store a big blob and hope for the best while we pay to sort through it every single time we use it.” Some vendors actually seem to espouse that.

Using indexes and search technology, you can compose a better workset. You can also avoid implementing machine learning or analytics and simply “pick” the data via criteria out of storage — and via signals even personalize data for users based on your data streams. Search is good. Use it.

Source: InfoWorld Big Data

With big data, CEOs find garbage in is still garbage out

July 14, 2016 by Andrew C Oliver Posted in Industry Insights & News

With big data, CEOs find garbage in is still garbage out

Another day, another CEO survey. This one, from KPMG, finds that CEOs don’t trust their analytics, the way their team is using or implementing them, or even the data used to make decisions in the first place. In fact, only 31 percent of respondents see their organizations as leaders in the use of data and analytics.

You have to ask: What’s the CEO’s culpability in all that?

Despite the evidence that math makes better decisions than gut calls, many companies haven’t gotten there yet. The approach many are taking is still the wrong one. Deploying big data infrastructure with no plan and no use cases will go the way any IT project with no plan or destination in mind goes.

What’s funny about KPMG’s survey is that despite the lack of trust in both the analytics and the data, a set of very modern concerns emerge: Customer loyalty, understanding millennials, projecting relevance of current products/services, and understanding customer needs/expectations. You know what you need to do to achieve those things? Fix how you collect data and perform analytics. You know who should really push for that and view themselves as leading that charge? The CEO of any decent company (aided and abetted by a CIO, CDO, CTO, and so on).

The millennial issue in particular connects with other CEO fears about keeping up with the pace of technology and integrating innovation. As late Generation X, I hate dealing with many companies. I can only imagine how young folks feel when a mobile website doesn’t work or they can’t get it to do everything they need it to do without going to the website — or worse, actually calling. (BTW, can I pay with Venmo?)

Having worked on numerous big data and BI projects, I’ve found that the start of the path to success is usually crystal clear: a list of business problems we’re trying to solve. The path to failure is nearly as clear: fascination with a technology without any clear understanding of how it fits into any overarching goal or objective. Here’s a hint: Just like you can’t use a word in its own definition, you can’t use a technology in your rationale to deploy it, as in, “we’re deploying Hadoop in order to develop a big data competency center.” Why the hell are you doing that?

The survey says you’re doing it to retain customers or to develop products or to understand how your customers are using your products. Forget about the storage or analytics technology or even the latest machine learning algorithm that shows promise. How would you normally do that? What data would you need? Is it good? What needs to be fixed? Start there before adding new technologies.

Oddly, IT loves doing rudderless deployments. I had a potential deal fall through the cracks when the customer basically said it knew there were people who could help them go the right direction but they were adding technology to keep their IT people engaged. It was fun. Talk about the tail wagging the dog — randomly adding new technologies to address employee retention concerns? Sorry, it doesn’t work quite that way.

How to get your mainframe's data for Hadoop analytics

June 30, 2016 by Andrew C Oliver Posted in Industry Insights & News

How to get your mainframe's data for Hadoop analytics

Many so-called big data — really, Hadoop — projects have patterns. Many are merely enterprise integration patterns that have been refactored and rebranded. Of those, the most common is the mainframe pattern.

Because most organizations run the mainframe and its software as a giant single point of failure, the mainframe team hates everyone. Its members hate change, and they don’t want to give you access to anything. However, there is a lot of data on that mainframe and, if it can be done gently, the mainframe team is interested in people learning to use the system rather than start from the beginning. After all, the company has only begun to scratch the surface of what the mainframe and the existing system have available.

There are many great techniques that can’t be used for data integration in an environment where new software installs are highly discouraged, such as in the case of the mainframe pattern. However, rest assured that there are a lot of techniques to get around these limitations.

Sometimes the goal of mainframe-Hadoop or mainframe-Spark projects is just to look at the current state of the world. However, more frequently they want to do trend analysis and track changes in a way that the existing system doesn’t do. This requires techniques covered by change data capture (CDC).

Technique 1: Log replication

Database log replication is the gold standard. There are a lot of tools like this. They require an install on the mainframe side and a receiver either on Hadoop or nearby.

All the companies that produce this software tell you that there is no impact on the mainframe. Do not repeat any of the nonsense the salesperson says to your mainframe team, as they will begin to regard you with a very special kind of disdain and stop taking your calls. After all, it is software, running on the mainframe, so it consumes resources and there is an impact.

The way log replication works is simple: DB2 (or your favorite database) writes redo logs as it writes to a table, the log-replication software reads that and deciphers it, then it sends a message (like a JMS, Kafka, MQSeries, or Tibco-style message) to a receiver on the other end that writes it to Hadoop (or wherever) in the appropriate format. Frequently, you can control this from having a single write to batches of writes.

The advantage is that this technique gives you a lot of control over how much data gets written and when. It doesn’t lock records or tables, but you get good consistency. You can also control the impact on the mainframe.

The disadvantage is that it is another software install. This usually takes a lot of time to negotiate with the mainframe team. Additionally, these products are almost always expensive and priced in part on a sliding scale (companies with more revenue get charged more even if their IT budget isn’t big).

Technique 2: ODBC/JDBC

No mainframe team has ever let me do this in production, but you can connect with ODBC or JDBC direct to DB2 on the mainframe. This might work well for an analyze-in-place strategy (especially with a distributed cache in between). Basically, you have a mostly normal database.

One challenge is that, due to how memory works on the mainframe, you are unlikely to get multiversion concurrency (which is relatively new to DB/2 anyhow) or even row-level locking. So watch for those locking issues! (Don’t worry — the mainframe team is highly unlikely to let you do this anyway.)

Technique 3: Flat-file dumps

On some interval, usually at night, you dump the tables to big flat files on the mainframe. Then you transmit them to a destination (usually via FTP). Ideally, after writing you move them to another filename so that it is clear they are done as opposed to still in transmission. Sometimes this is push and sometimes this is pull.

On the Hadoop side, you use Pig or Spark, or sometimes just Hive, to parse the usually delimited files and load them into tables. In an ideal world, these are incremental dumps but frequently they are full-table dumps. I’ve written SQL to diff a table against another to look for changes more times than I like to admit.

The advantage to this technique is there is usually no software install, so you can schedule this at whatever increment you prefer. It is also somewhat recoverable because you can dump a partition and reload a file whenever you like.

The disadvantage is that this technique is fairly brittle and the impact on the mainframe is bigger than is usually realized. One thing I found surprising is that the tool to do this is an option for DB2 on the mainframe, though it costs a considerable amount of money.

Technique 4: VSAM copybook files

Although I haven’t seen the latest “Independence Day” movie (having never gotten over the “uploading the Mac virus to aliens” thing from the first one), I can only assume the giant plot hole was that the aliens easily integrated with defense mainframes and traversed encoding formats with ease.

Sometimes the mainframe team is already generating VSAM/copybook file dumps on the mainframe in the somewhat native EBCDIC encoding. So, this technique has most of the same drawbacks as the flat-file dumps, with the extra burden of having to translate them as well. There are traditional tools like Syncsort, but with some finagling the open source tool Legstar also works. However, a word of caution: If you want commercial support from Legsem (Legstar’s maker), I found it doesn’t respond to email or answer its phones. That said, the code is mostly straightforward.

Orchestration and more

Virtually any of these techniques will require some kind of orchestration, which I’ve covered before. I’ve had more than one client require me to write that tool in shell scripts or, worse, Oozie (which is Hadoop’s worst-written piece of software and all copies should be taken out to the desert and turned into a Burning Man statue). Seriously, though, use an orchestration tool rather than writing your own or leaving it implicit.

Just because there are patterns doesn’t mean you should write this from scratch. There are certainly ETL tools that do some or most of this. To be fair, frequently the configuration and mapping required makes you wish you had done so in the end. You can check out anything from Talend to Zaloni that might work better than rolling your own.

The bottom line is that you can use mainframe data with Hadoop or Spark. There is no obstacle that you can’t overcome, via no-install to middle-of-the-night to EBCDIC techniques. As a result, you don’t have to replace the mainframe just because you’ve decided to do more advanced analytics from enterprise data hubs to analyze-in-place. The mainframe team should like that.

Source: InfoWorld Big Data

HDFS: Big data analytics' weakest link

June 23, 2016 by Andrew C Oliver Posted in Industry Insights & News

HDFS: Big data analytics' weakest link

For large-scale analytics, a distributed file system is kind of important. Even if you’re using Spark you need to pull a lot of data into memory very quickly. Having a file system that supports high burst rates — up to network saturation — is a good thing. However, Hadoop’s eponymous file system (Hadoop Distributed File System, aka HDFS) may not be all it’s cracked up to be.

What is a distributed file system? Think of your normal file system, which stores files in blocks. It has some way of noting where on the physical disk a block starts and how that block matches to a file. (One implementation is a file allocation table or FAT of sorts.) In a distributed file system, the blocks are “distributed” among disks attached to multiple computers. Additionally, like RAID or most SAN systems, the blocks are duplicated so that if a node is lost from the network then no data is lost.

What’s wrong with HDFS?

In HDFS, the role of the “file allocation table” is taken by the namenode. You can have more than one namenode (for redundancy), but essentially the namenode constitutes both a failure point and a type of bottleneck. While a namenode can fail over, that does take time. It also means keeping in sequence, which introduces more latency. In HDFS there is also some threading and locking stuff that happens as well as the fact that it is garbage-collected Java. Garbage collection — especially Java garbage collection — requires a lot of memory (generally at least 10x to be as efficient as native memory).

Moreover, in developing applications for distributed computing we often figure that whatever inefficiency we inject in language choice will be outweighed by I/O. Meaning so what if it took me 1,000 operations to open a file and give you some data, because the time it took for an I/O operation was 10x that. Simplistically speaking, the higher level the language, the more operations or “work” is executed per line of code.

That said, the lower level the component, the more we pay for this inefficiency. Meaning it is rational to trade lower operation performance for shorter time to develop and deploy, lower maintenance costs, and better security (i.e., buffer overruns and underflows are next to impossible in a high-level, garbage-collected language like Java). However, as you go lower level the inefficiency catches up. This is why most operating systems are written in C and Assembly as opposed to, say, Java (despite several attempts). One could argue that a file system is at that lower level.

The tooling around HDFS is kind of poor compared to any file system or distributed store you’ve ever dealt with. You’re asking your IT operations people to administer a Java-based file system that, at best, implements bastardized versions of their “favorite” POSIX tools. Yes you can mount HDFS with NFS, but have you actually tried that? How well did that work out for you, really? The other tools for mounting HDFS are also pretty poor. Instead, you deal with weird REST bridge tools and a command-line client that doesn’t even accept most options for ls, let alone anything else.

There are of course other particular aspects of HDFS that are inefficient or just problematic. Most derive from the fact that HDFS is a file system written in Java, of all things.

What about HDFS could be fixed? HDFS has native code extensions to make it more efficient. Meanwhile, the community has improved the namenode substantially. However, on higher-end systems with a lot of operations you still hit a namenode bottleneck that you can see in your favorite monitoring and diagnostic tools. Moreover, kicking the namenode over to solve ghost problems is something that occasionally has to happen. Overall, a more mature distributed file system written in C or C++ with mature bindings to common operating systems is often a better option.

Spark and cloud demand change

Many of the early corporate Hadoop deployments were done on-premises, and the sales compensation plan for the early vendors was based on this assumption. With the rise of Spark and cloud deployments, it isn’t uncommon to see Amazon S3 used as a data source.

The Hadoop vendors all had a vision of a more unified Hadoop platform where HDFS would integrate with security components (Cloudera and Hortonworks of course doing their own separate and incompatible security systems). However, with MapReduce giving way to Spark, and Spark being a lot more ambivalent about the type of file system, the “tight” integration with the file system seems less critical let alone feasible.

Meanwhile, alternative file systems such as MapR’s MapR FS are gaining in interest as companies discover the joys of dealing with HDFS in an operational setting. Moreover, the design of MapR FS doesn’t include namenodes, but uses a more standard and familiar cluster master election scheme. MapR’s partitioning design should also work better to avoid bottlenecking.

There are other options like Ceph, or an object store like S3, or a more standard distributed file system like Gluster. I/O Performance tests [PDF] tend to yield very positive results for Gluster. If you’re mainly looking to store files that you’re going to suck into Spark, there are rational reasons to pick something that is more operationally familiar to your IT operations folks and yet is, well, faster.

I’m not saying that large HDFS installations are going to migrate overnight, but that as we see more Spark and cloud projects we’re likely to see less and less HDFS over time. That will leave YARN as the remaining piece of Hadoop that people will still use. Maybe.

Source: InfoWorld Big Data

5 big data sources for strategic sentiment analysis

June 16, 2016 by Andrew C Oliver Posted in Industry Insights & News

5 big data sources for strategic sentiment analysis

Somewhere, someone is tweeting “[This airline] sucks the big one!” In the past, they would have been ignored. These days many airlines respond with sympathy (“We’re so sorry you’re having a rough trip — please DM us, so we can resolve it”) or send an invitation to call an 800-number (where you can wait on hold forever).

A tool called sentiment analysis, or the mathematical categorization of statements’ negative or positive connotations, gives companies powerful ways to analyze aggregate language data across all sorts of communications, not only tweets. There’s real value in measuring sentiment inside and outside your company. Here are five of the most valuable sentiment sources to tap.

Customer inquiries

When a customer asks about your product or services, metrics on overall sentiment, the length of the message, and words used can be compared to past inquiries. Different inquiries warrant different treatment.

Customer service

When a customer writes in about a problem, is he or she really upset or simply asking, “Hi, can you look into this?” Sentiment analysis of these interactions helps track the way customers feel about your company or product over time. Is your relationship solid? When interacting with an inexperienced operator, do customers walk away satisfied?

Employee interactions

When employees talk, are they happy? Satisfied? Also, do certain employees spread unhappiness and dissatisfaction? You’re already scanning emails for Trojans and IP violations, why not for emotions? Whether this is email, Slack, or some other chat tool, communication and sentiment scores can be useful tools to learn how employees are feeling and how likely they are to leave.

HR interactions

When HR gets involved, what are the interactions like? Do they settle the issue or do employees walk away upset? Happy employees are more likely to stay or at least less apt to sue.

News and public data

This is particularly useful for large, public companies. Sure, your PR company sends you a list at the end of the month, but is your news trending positive or negative over time and on what topics?

By looking at these data sets, you can strengthen your marketing and PR operations, possibly improve employee retention, and ultimately improve customer service. Sentiment analysis is readily available and easily accessible to most developers. It offers a powerful avenue to quantify the success of your business and use data science to improve it.

We have the big data tools — let's learn to use them

June 9, 2016 by Andrew C Oliver Posted in Industry Insights & News

We have the big data tools — let's learn to use them

Recently, at the Apache Spark Maker Community event in San Francisco, I was on a panel and feeling a bit salty. It seems many people have prematurely declared victory in the data game. A few people have achieved self-service, and even more have claimed to.

In truth, this is a tiny minority — and most of those people have achieved cargo-cult datacentricity. They use Hadoop and/or Spark and pull data into Excel, manipulate it, and paste it into PowerPoint. Maybe they’ve added Tableau and are able to make prettier charts, but what really has changed? Jack, that’s what.

Self-service is only step one on this trip to data-driven decision-making. Companies need to know their data before they can consider their choices — but this is still very much data at the edges with a meat cloud in the center.

So far, we use computer aided decision-making and computer-driven process where we have to: advanced fraud detection, algorithmic trading, and rigorously regulated processes (such as Obamacare). Generally, we don’t use it elsewhere.

Hundreds of millions of people are sitting in cubicles with a grid on their screen manually typing numbers into a spreadsheet. This manual data labor is the bane of corporate existence. As Peter Gibbons put it, “Human beings were not meant to sit in little cubicles staring at computer screens all day, filling out useless forms.”

We already have the technologies necessary to eliminate this and free humans for the intuitive leaps and creative endeavors they excel at. Yet as a recent New York Times article noted, we mostly use new technology to do the same old thing and do not reap the productivity rewards.

Though we need better tools, the wisdom of the day is that everyone will code, because that’s what the tools require. Truly, that only seems reasonable because Spark still sucks so much (more fairly, it’s a relatively low-level distributed computing framework). It only looks brilliant compared to what we had.

At the same time, Spark isn’t actually a framework for managing and gaining insights from our data. Now, the rabble will start chanting “applications!” Yet having 100 closed-loop applications will quickly lead to more Excelitis.

Instead, it’s time to employ a strategy. As I once said in a discussion about groupware, in a mature business, every email is a little failure, as is every hand-generated report or spreadsheet.

I’ll go further and say every time you have to stare at your phone, it’s a microfailure. In any city, look around, you’ll see hundreds of people missing everything around them while they hold their phones in their hands and stare at a tiny screen. Part of the problem is we’ll still polling and pulling for data. A machine-driven process (designed by people) would instead prompt us: You would know you’re not missing anything and do your job — or better yet, live your life.

Success isn’t more visualizations. Success is the abolition of the PC and the smartphone as we know them. Success is when we’re alerted to data as needed and spend most of our time making creative and intuitive leaps. Self-service, in other words, still indentures us to data labor. The next huge leaps are when we design real systems and go back to living something that looks a lot more like the future envisioned in the 19th century.

To do so, we must use data and the scientific method to make decisions and, more important, create processes and systems to make decisions rather than making them ourselves. We need to create methodologies around doing this rather than hoping the next tool of the day will free us from thinking about how to do this.

We already have the tools we need to get there. It’s time to start using them correctly.

Source: InfoWorld Big Data

OK computer: When pop music meets machine learning

May 19, 2016 by Andrew C Oliver Posted in Industry Insights & News

OK computer: When pop music meets machine learning

It’s Moogfest season here in Durham, so there’s been a lot of the discussion in the office around music, data lakes, and the heat map we’re building for the festival. But the conversation took a different turn, thanks to a tweet.

Many months ago when I was at IBM Insight, I tweeted a snide remark about computer-generated jokes. Fast-forward to this week, when former “Monk” and Letterman writer Joe Toplyn responded with a link “proving” that computers could generate jokes that were funny … at least to the easily amused. Amid the discussion, someone drove by playing crappy autotune pop music.

This got me thinking about whether you could generate hit pop songs. Most of the popular songs are written by two middle-aged guys from Sweden anyhow. Plus, there are algorithms that can detect which songs are likely to be a hit. While the current hit song generator is simply song titles with performers, we also have an algorithm that can generate tweets for the presumptive Republican presidential nominee. It seems like a short trip to get from hit detector to factory songwriting to neural net for political speech to full-on pop song generator!

We’d need parameters like a genre (pop, hip-hop, dance) and probably gender, as well as whether it’s a party track, a love song, happy, sad, angry, and so on. Then maybe we’d train a neural net on the corpus of songs by the two Swedes. Add that to an adaptation of the hit detection algorithm and you should have not a great song, but at the very least a popular one.

Unlike the acts at Moogfest, which tend to be more complex, modern pop music is especially appropriate to this approach. Like a business, it has a limited grammar or topic area. The beat and the musical accompaniment fall along more predictable lines. and there is an almost cyclical behavior to it.

As in business, should we achieve this level of automation, we could even stop doing chart toppers and instead make a custom hit for each customer. By profiling your reactions and moods, we could play what you most want to hear right now: a fresh (but predictable) tune out of our Spotify, composed expressly for you in real time. As in business, we’d save cost by stratifying this function and identifying like music listeners, then composing a smaller set of tunes on the fly and picking between them.

As with many companies, a lot of manual labor goes into even a factory composition and certainly in the performance. Pop stars effectively run a cottage industry of marketing the songs. Though we could automate the role of the two Swedish composers, it would be more difficult to automate the pop stars’ work. Granted, many of these songs aren’t difficult to perform, so a group could show up and sing Lady Ga Ga’s “Telephone” in a bar with no previous rehearsal. However, the human element of the performance and the creative that goes into the branding and marketing are harder to replicate. The younger generation may become ready to watch S1m0ne perform it on Periscope, or maybe the band of the future looks more like Gorillaz.

Unlike in business, machines could probably generate the songwriting without anyone noticing, and we don’t have to push for people to just let go to gain mass adoption. Many pop fans already believe the performers write the tunes. Wouldn’t that be a kick if we achieve mass adoption of machine learning in songwriting before we achieve full data-driven decision-making in business?

Bare Metal Servers and Cloud Server Hosting

Author Archives: Andrew C Oliver
Home / Articles Posted by Andrew C Oliver (Page 2)

7 big data tools to ditch in 2017

Maybe by 2018 …

Your gripes here

8 'new' enterprise products we don't want to see

1. A column family or key value store database

2. ETL/monitoring/data catalogs

3. On-prem clouds

4. Hadoop/Spark management with performance enhancements

5. Generic data visualization tool

6. Content management systems by any other name

7. Another streaming tool

8. Server-side blah blah with mobile added

Bossie Awards 2016: The best open source big data tools

Big data problem? Don't forget search

A bad job for Spark

With big data, CEOs find garbage in is still garbage out

How to get your mainframe's data for Hadoop analytics

HDFS: Big data analytics' weakest link

What’s wrong with HDFS?

5 big data sources for strategic sentiment analysis

Customer inquiries

Customer service

We have the big data tools — let's learn to use them

OK computer: When pop music meets machine learning