Bossies 2016: The Best of Open Source Software Awards

Bossies 2016: The Best of Open Source Software Awards

Does anyone even try to sell closed-source software anymore? It must be hard, when so many of the tools used to power the world’s largest datacenters and build the likes of Google, Facebook, and LinkedIn have been planted on GitHub for everyone to use. Even Google’s magic sauce, the software that knows what you will read or buy before you read or buy it, is now freely available to any ambitious developer with dreams of a smarter application.

Google didn’t used to share its source code with the rest of us. It used to share research papers, then leave it to others to come up with the code. Perhaps Google regrets letting Yahoo steal its thunder with Hadoop. Whatever the reason, Google is clearly in the thick of open source now, having launched its own projects — TensorFlow and Kubernetes — that are taking the world by storm.

Of course, TensorFlow is the machine learning magic sauce noted above, and Kubernetes the orchestration tool that is fast becoming the leading choice for managing containerized applications. You can read all about TensorFlow and Kubernetes, along with dozens of other excellent open source projects, in this year’s Best of Open Source Awards, aka the Bossies. In all, our 2016 Bossies cover 72 winners in five categories:

The software tumbling out of Google and other cloudy skies marks a huge shift in the open source landscape and an even bigger shift in the nature of the tools that businesses use to build and run their applications. Just as Hadoop reinvented data analytics by distributing the work across a cluster of machines, projects such as Docker and Kubernetes (and Mesos and Consul and Habitat and CoreOS) are reinventing the application “stack” and bringing the power and efficiencies of distributed computing to the rest of the datacenter.

This new world of containers, microservices, and distributed systems brings plenty of challenges too. How do you handle monitoring, logging, networking, and security in an environment with thousands of moving parts, where services come and go? Naturally, many open source projects are already working to answer these questions. You’ll find a number of them among our Bossie winners.

We’ve come to expect new names in the Bossies, but this year’s winners may include more newcomers than ever. Even in the arena of business applications, where you find many of the older codebases and established vendors, we see pockets of reinvention and innovation. New machine learning libraries and frameworks are taking their place among the best open source development and big data tools. New security projects are taking a cloud-inspired devops approach to exposing weaknesses in security controls.

Open source software projects continue to fuel an amazing boom in enterprise technology development. If you want to know what our applications, datacenters, and clouds will look like in the years to come, check out the winners of InfoWorld’s Best of Open Source Awards.

Source: InfoWorld Big Data

SAP woos SMB developers with an 'express' edition of Hana

SAP woos SMB developers with an 'express' edition of Hana

SAP has made no secret of the fact that its bets for the future rest largely on its Hana in-memory computing platform. But broad adoption is a critical part of making those bets pay off.

Aiming to make Hana more accessible to companies of all shapes and sizes, the enterprise software giant on Monday unveiled a downloadable “express” edition that developers can use for free.

The new express edition of SAP Hana can be used free of charge on a laptop or PC to develop, test, and deploy production applications that use up to 32GB of memory; users who need more memory can upgrade for a fee. Either way, the software delivers database, application, and advanced analytics services, allowing developers to build applications that use Hana’s transactional and analytical processing against a single copy of data, whether structured or unstructured.

Originally launched more than five years ago, Hana uses an in-memory computing engine in which data to be processed is held in RAM instead of being read from disks or flash storage. This makes for faster performance. Hana was recently updated with expanded analytics capabilities and tougher security, among other features.

Hana also forms the basis for S/4Hana, the enterprise suite that SAP released in early 2015.

The new express edition of Hana can be downloaded from the SAP developer center and installed on commodity servers, desktops, and laptops using a binary installation package with support for either SUSE Linux Enterprise Server or Red Hat Enterprise Linux. Alternatively, it can be installed on Windows or Mac OS by downloading a virtual machine installation image that is distributed with SUSE Linux Enterprise Server.

Tutorials, videos, and community support are available. The software can also be obtained through the SAP Cloud Appliance Library, which provides deployment options for popular public cloud platforms.

“The new easy-to-consume model via the cloud or PC and free entry point make a very attractive offering from SAP,” said Cindy Jutras, president of research firm Mint Jutras. “Now companies such as small-to-midsize enterprises have access to a data management and app development platform that has traditionally been used by large enterprises.”

Source: InfoWorld Big Data

Salesforce is betting its Einstein AI will make CRM better

Salesforce is betting its Einstein AI will make CRM better

If there was any doubt that AI has officially arrived in the world of enterprise software, Salesforce just put it to rest. The CRM giant on Sunday announced Einstein, a set of artificial intelligence capabilities it says will help users of its platform serve their customers better.

AI’s potential to augment human capabilities has already been proven in multiple areas, but tapping it for a specific business purpose isn’t always straightforward. “AI is out of reach for the vast majority of companies because it’s really hard,” John Ball, general manager for Salesforce Einstein, said in a press conference last week.

With Einstein, Salesforce aims to change all that. Billing the technology as “AI for everyone,” it’s putting Einstein’s capabilities into all its clouds, bringing machine learning, deep learning, predictive analytics, and natural language processing into each piece of its CRM platform.

In Salesforce’s Sales Cloud, for instance, machine learning will power predictive lead scoring, a new tool that can analyze all data related to leads — including standard and custom fields, activity data from sales reps, and behavioral activity from prospects — to generate a predictive score for each lead. The models will continuously improve over time by learning from signals like lead source, industry, job title, web clicks, and emails, Salesforce said. 

Another tool will analyze CRM data combined with customer interactions such as inbound emails from prospects to identify buying signals earlier in the sales process and recommend next steps to increase the sales rep’s ability to close a deal.

In Service Cloud, Einstein will power a tool that aims to improve productivity by pushing a prioritized list of response suggestions to service agents based on case context, case history, and previous communications.

Salesforce’s Marketing, Commerce, Community, Analytics, IoT and App Clouds will benefit similarly from Einstein, which leverages all data within Salesforce — including activity data from its Chatter social network, email, calendar, and ecommerce as well as social data streams and even IoT signals — to train its machine learning models.

The technology draws on recent Salesforce acquisitions including MetaMind. Roughly 175 data scientists have helped build it, Ball said.

Every vendor is now facing the challenge of coming up with a viable AI product, said Denis Pombriant, managing principal at Beagle Research Group.

“Good AI has to make insight and knowledge easy to grasp and manipulate,” Pombriant said. “By embedding products like Einstein into customer-facing applications, we can enhance the performance of regular people and enable them to do wonderful things for customers. It’s not about automation killing jobs; it’s about automation making new jobs possible.”

Most of Salesforce’s direct competitors, including Oracle, Microsoft, and SAP, have AI programs of their own, some of them dating back further than Salesforce’s, Pombriant noted.

Indeed, predictive analytics has been an increasingly significant part of the marketer’s toolbox for some time, and vendors including Pegasystems have been applying such capabilities to CRM.

“I think more than any other move, such as IoT, AI is the next big thing we need to focus on,” Pombriant said. “If IoT is going to be successful, it will need a lot of good AI to make it all work.”

New Einstein features will start to become available next month as part of Salesforce’s Winter ‘17 release. Many will be added into existing licenses and editions; others will require an additional charge.

Also on Sunday, Salesforce announced a new research group focused on delivering deep learning, natural language processing, and computer vision to Salesforce’s product and engineering teams.

Source: InfoWorld Big Data

ClearDB Joins Google Cloud Platform Technology Partner Program

ClearDB Joins Google Cloud Platform Technology Partner Program

ClearDB has announced it has been named a Google Cloud Platform Technology Partner, providing a fully-managed database service that improves MySQL database performance, availability, and ease of control.

The collaboration between Google Cloud Platform and ClearDB enables organizations to benefit from accelerated application development via rapid deployment of database assets, highly available MySQL that avoids application disruptions and a pay-as-you-go model that eliminates the need for organizations to procure and maintain costly infrastructure.

To help customers get the most out of Google Cloud Platform services, Google works closely with companies such as ClearDB, that deliver best in class fully managed database services on top of Google Cloud Platform.

“The combination of ClearDB and Google Cloud Platform can free users from the overhead of managing infrastructure, provisioning servers and configuring networks,” said Jason Stamper, an analyst with 451 Research.  “ClearDB on Google Cloud Platform can allow users to focus on business innovation and growth.”

ClearDB is cloud agnostic and is the only vendor offering MySQL database-as-a-service (DBaaS) to three cloud providers. The company offers My SQL users a quick and efficient means to rapidly deploy highly available database assets in the cloud. ClearDB’s DBaaS is built on top of native MySQL, no code changes are required, thus simplifying deployment while ensuring high availability with sub-second automatic failover capabilities via high-availability routers.

“With database assets playing a vital role in achieving business success in today’s always-on, data-driven economy, the ability to accelerate database-powered application development, ensure ‘always-on’ availability and provide fully-managed database services is essential,” said Allen Holmes, ClearDB vice president of marketing and platform alliances. “ClearDB is committed to expanding its DBaaS offering to all major cloud providers and we are excited to add Google Cloud to our existing lineup of Microsoft Azure and Amazon EC2 offerings.”

Designed to work on major public clouds and to support private cloud and on-premises operations, ClearDB’s nonstop Data Services Platform extends ClearDB’s MySQL DBaaS offering and automates the provisioning and management process with an intuitive services framework that accelerates performance and guarantees high availability in any cloud marketplace, including Microsoft Azure, Amazon Web Services (AWS), Heroku, AppFog, SoftLayer and IBM Bluemix – all while reducing database license footprint and related infrastructure costs. 

Source: CloudStrategyMag

How MIT's C/C++ extension breaks parallel processing bottlenecks

How MIT's C/C++ extension breaks parallel processing bottlenecks

Earlier this week, MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) department announced word of Milk, a system that speeds up parallel processing of big data sets by as much as three or four times.

If you think this involves learning a whole new programming language, breathe easy. Milk is less a radical departure from existing software development than a refinement of an existing set of C/C++ tools.

All together now

According to the paper authored by the CSAIL team, Milk is a  C/C++ language family extension that addresses the memory bottlenecks plaguing big data applications. Apps that run in parallel contend with each other for memory access, so any gains from parallel processing are offset by the time spent waiting for memory.

Milk solves these problems by extending an existing library, OpenMP, widely used in C/C++ programming for parallelizing access to shared memory. Programmers typically use OpenMP by annotating sections of their code with directives (“pragmas”) to the compiler to use OpenMP extensions, and Milk works the same way. The directives are syntactically similar, and in some cases, they’re minor variants of the existing OpenMP pragmas, so existing OpenMP apps don’t have to be heavily reworked to be sped up.

Milk’s big advantage is that it performs what the paper’s authors describe as “DRAM-conscious clustering.” Since data shuttled from memory is cached locally on the CPU, batching together data requests from multiple processes allows the on-CPU cache to be shared more evenly between them.

The most advanced use of Milk requires using some functions exposed by the library — in other words, some rewriting — but it’s clearly possible to get some results right away by simply decorating existing code.

Let’s not throw all this out yet

As CPU speeds top out, attention has turned to other methods to ramp up processing power. The most direct option is to scale out: spreading workloads across multiple cores on a single chip, across multiple CPUs, or throughout a cluster of machines. While a plethora of tools exist to spread out workloads in these ways, the languages used for them don’t take parallelism into account as part of their designs. Hence the creation of functional languages like Pony to provide a fresh set of metaphors for how to program in such environments.

Another approach has been to work around the memory-to-CPU bottleneck by moving more of the processing to where the data already resides. Example: the MapD database, which uses GPUs and their local memory for both accelerated processing and distributed data caching.

Each of these approaches has their downsides. With new languages, there’s the pain of scrapping existing workflows and toolchains, some of which have decades of work behind them. Using GPUs has some of the same problems: Shifting workloads to a GPU is easy only if the existing work’s abstracted away through a toolkit that can be made GPU-aware. Otherwise, you’re back to rewriting everything from scratch. 

A project like Milk, on the other hand, is adds a substantial improvement to a tool set that’s already widely used and well-understood. It’s always easier to transform existing work than tear it down and start over, so Milk provides a way to squeeze more out of what we already have.

Source: InfoWorld Big Data

Global Data Center Construction Market Expected To Grow Through 2020

Global Data Center Construction Market Expected To Grow Through 2020

The global data center construction market is expected grow at a CAGR of more than 12% during the period 2016-2020, according to Technavio’s latest report.

In this report, Technavio covers the market outlook and growth prospects of the global data center construction market for 2016-2020. The report also presents the vendor landscape and a corresponding detailed analysis of the major vendors operating in the market. It includes vendors across all geographical regions. The report provides the performance and market dominance of each of the vendors in terms of experience, product portfolio, geographical presence, financial condition, R&D, and customer base.

“The data center construction market is growing significantly with major contributions by cloud service provider (CSPs) and telecommunication and colocation service providers worldwide. The increased construction is facilitated by the increased demand for cloud-based service offerings and big data analytics driven by the stronger growth of data through connected devices in the form of the IoT,” says Rakesh Kumar Panda, a lead data center research expert from Technavio.

Technavio’s ICT research analysts segment the global data center construction market into the following regions:

  • Americas
  • EMEA
  • APAC

In 2015, with a market share of close to 45%, Americas dominated the global data center construction market, followed by EMEA at around 34% and APAC with a little over 21%.

Source: CloudStrategyMag

Data Foundry Expands Managed Services Offering With Cloud Services

Data Foundry Expands Managed Services Offering With Cloud Services

Data Foundry has announced the addition of Cloud Services to its portfolio of managed services. Data Foundry has always prided itself on being a strategic IT partner and offering more than just colocation space. We are pleased to begin providing Dedicated Cloud Storage and CloudTap, a private and secure cloud connection service.

In addition to the new Cloud Services, Data Foundry currently offers a suite of virtual and physical security services, network services, structured cabling and infrastructure installation.

“We continue to aggressively expand our managed services portfolio, and we are excited to provide our customers with more options when it comes to storage and access to cloud services,” says Mark Noonan, executive vice president of sales. “These new services complement our core services and enable our customers to better manage their overall IT strategy.”

Data Foundry’s Dedicated Cloud Storage is an enterprise storage-as-a-service solution with high availability features. It is a private storage solution that exists on virtual storage arrays and consists of dedicated cores and disks. Workloads in each array are completely isolated from one another, and users own their encryption keys. Users can also choose from SSD, SATA or SAS storage, or a combination of these. This provides companies with greater flexibility and reduced capital expenditure, as they would normally have to purchase these storage resources individually. Storage arrays are located in our Texas 1 data center in Austin, TX, and our customers are able to access their arrays via private transport, making it a highly secure and fast option for storage.

Data Foundry’s other new cloud service, CloudTap, allows colocation customers to access cloud storage and cloud services from major providers, such as Azure, AWS and Google Cloud without traversing the public Internet. Our network engineers have designed a solution that enables protected connectivity to all the major cloud providers.

Source: CloudStrategyMag

AWS Direct Connect Service now Available in Equinix Los Angeles

AWS Direct Connect Service now Available in Equinix Los Angeles

Equinix, Inc. has announced the immediate availability of Amazon Web Services (AWS) Direct Connect cloud service in Equinix’s Los Angeles (IBX®) data centers. With AWS Direct Connect, companies can connect their customer-owned and managed infrastructure directly to AWS, establishing a private connection to the cloud that can reduce costs, increase performance, and deliver a more consistent network experience. The Equinix Los Angeles location brings the total number of Equinix metros offering AWS Direct Connect service to twelve, globally, five of those are within North America.

“As one of the early data center partners to offer AWS Direct Connect services, our goal has always been to provide our customers with the ability to realize the full benefits of the cloud — without worrying about application latency or cost issues. By offering access to AWS via the Direct Connect service in Los Angeles, we are providing additional ways for our North American customers to achieve improved performance of their cloud-based applications,” said Greg Adgate, vice president, Equinix.

Cloud adoption continues to rise among both startups and the enterprise.  In fact, recent survey results from 451 Research’s “Voice of the Enterprise: Cloud” program found that 52% of 440 enterprises surveyed indicated that their public cloud spending would increase in the immediate future.* By providing direct access to AWS cloud inside Equinix data centers, Equinix is enabling enterprise CIOs to advance their cloud strategies by seamlessly and safely incorporating public cloud services into their existing architectures.

“Our quarterly survey of cloud adoption and spending shows steadily increasing growth in both enterprise usage and investment in cloud services. Equinix is fostering this trend by enabling direct, low-latency, secure connections to cloud services, like AWS Direct Connect, within its multi-tenant facilities,” said Andrew Reichman, director, Voice of the Enterprise: Cloud.

The Equinix Los Angeles campus includes four Equinix IBX data centers, which are connected via Metro Connect. While AWS Direct Connect service will reside in the LA3 facility, customers can connect to AWS Direct Connect from any one of these IBX data centers through Metro Connect.  Equinix’s Los Angeles data centers are business hubs for more than 250 companies, and offer interconnections to network services from more than 80 service providers.

Equinix Los Angeles data centers are central to the network strategies of digital content and entertainment companies looking to reach their end users quickly. These companies can now leverage the benefits of AWS cloud to create, deliver and measure compelling content and customer experiences in a highly scalable, elastic, secure and cost effective manner utilizing AWS Direct Connect.

With the addition of Los Angeles, Equinix now offers the AWS Direct Connect service in Amsterdam, Dallas, Frankfurt, London, Los Angeles, Osaka, Seattle, Silicon Valley, Singapore, Sydney, Tokyo and Washington, D.C./Northern Virginia. Equinix customers in these metros will be able to lower network costs into and out of AWS and take advantage of reduced AWS Direct Connect data transfer rates.

Source: CloudStrategyMag

New programming language promises a 4X speed boost on big data

New programming language promises a 4X speed boost on big data

Memory management can be challenge enough on traditional data sets, but when big data enters the picture, things can slow way, way down. A new programming language announced by MIT this week aims to remedy that problem, and so far it’s been found to deliver fourfold speed boosts on common algorithms.

The principle of locality is what governs memory management in most computer chips today, meaning that if a program needs a chunk of data stored at some memory location, it’s generally assumed to need the neighboring chunks as well. In big data, however, that’s not always the case. Instead, programs often must act on just a few data items scattered across huge data sets.

Fetching data from main memory is the major performance bottleneck in today’s chips, so having to fetch it more frequently can slow execution considerably.

“It’s as if, every time you want a spoonful of cereal, you open the fridge, open the milk carton, pour a spoonful of milk, close the carton, and put it back in the fridge,” explained Vladimir Kiriansky, a doctoral student in electrical engineering and computer science at MIT.

With that challenge in mind, Kiriansky and other researchers from MIT’s Computer Science and Artificial Intelligence Laboratory (CSAIL) have created Milk, a new language that lets application developers manage memory more efficiently in programs that deal with scattered data points in large data sets.

Essentially, Milk adds a few commands to OpenMP, an API for languages such as C and Fortran that makes it easier to write code for multicore processors. Using it, the programmer inserts a few additional lines of code around any instruction that iterates through a large data collection looking for a comparatively small number of items. Milk’s compiler then figures out how to manage memory accordingly.

With a program written in Milk, when a core discovers that it needs a piece of data, it doesn’t request it — and the attendant adjacent data — from main memory. Instead, it adds the data item’s address to a list of locally stored addresses. When the list gets long enough, all the chip’s cores pool their lists, group together those addresses that are near each other, and redistribute them to the cores. That way, each core requests only data items that it knows it needs and that can be retrieved efficiently.

In tests on several common algorithms, programs written in the new language were four times as fast as those written in existing languages, MIT says. That could get even better, too, as the researchers work to improve the technology further. They’re presenting a paper on the project this week at the International Conference on Parallel Architectures and Compilation Techniques.

Source: InfoWorld Big Data

Beyond Solr: Scale search across years of event data

Beyond Solr: Scale search across years of event data

Every company wants to guarantee uptime and positive experiences for its customers. Behind the scenes, in increasingly complex IT environments, this means giving operations teams greater visibility into their systems — stretching the window of insight from hours or days to months and even multiple years. After all, how can IT leaders drive effective operations today if they don’t have the full-scale visibility needed to align IT metrics with business results?

Expanding the window of visibility has clear benefits in terms of identifying emerging problems anywhere in the environment, minimizing security risks, and surfacing opportunities for innovation. Yet it also has costs. From an IT operations standpoint, time is data: The further you want to see, the more data you have to collect and analyze. It is an enormous challenge to build a system that can ingest many terabytes of event data per day while maintaining years of data, all indexed and ready for search and analysis.

These extreme scale requirements, combined with the time-oriented nature of event data, led us at Rocana to build an indexing and search system that supports ever-growing mountains of operations data — for which general-purpose search engines are ill-suited. As a result, Rocana Search has proven to significantly outperform solutions such as Apache Solr in data ingestion. We achieved this without restricting the volume of online and searchable data, with a solution that balances responsively and scales horizontally via dynamic partitioning.

The need for a new approach

When your mission is to enable petabyte-level visibility across years of operational data, you face three primary scalability challenges:

  • Data ingestion performance: As the number of data sources monitored by IT operations teams grows, can the system continue to pull in data, index it immediately, store it indefinitely, and categorize it for faceting?
  • Volume of searchable data that can be maintained: Can the system keep all data indexed as the volume approaches petabyte scale, without pruning data at the cost of losing historical analysis?
  • Query speed: Can the index perform more complex queries without killing performance?

The major open source contenders in this search space are Apache Solr and Elasticsearch, which both use Lucene under the covers. We initially looked very closely at these products as potential foundations on which to build the first version of Rocana Ops. While Elasticsearch has many features that are relevant to our needs, potential data loss has significant implications for our use cases, so we decided to build the first version of Rocana Ops on Solr.

Solr’s scaling method is to shard the index, which splits the various Lucene indexes into a fixed number of separate chunks. Solr then spreads them out across a cluster of machines, providing parallel and faster ingest performance. At lower data rates and short data retention periods, Solr’s sharding model works. We successfully demonstrated this in production environments with limited data retention requirements. But the Lucene indexes still grow larger over time, presenting persistent scalability challenges and prompting us to rethink the best approach to search in this context.

Compare partitioning models

Like Elasticsearch and Solr, Rocana Search is a distributed search engine built on top of Lucene. The Rocana Search sharding model is significantly different from Elasticsearch and Solr. It creates new Lucene indexes dynamically over time, enabling customers to retain years of indexed data on disk and have it immediately accessible for query, while keeping each Lucene index small enough to maintain low-latency query times.

Why didn’t the Solr and Elasticsearch sharding models work for us? Both solutions have a fixed sharding model, where you specify the number of shards at the time the collection is created.

With Elasticsearch, changing the number of shards requires you to create a new collection and re-index all of your data. With Solr, there are two ways to grow the number of shards for a pre-existing collection: splitting shards and adding new shards. Which method you use depends on how you route documents to shards. Solr has two routing methods, compositeId (default) and implicit. With either method, large enterprise production environments will eventually hit practical limits for the number of shards in a single index. In our experience, that limit is somewhere between 600 and 1,000 shards per Solr collection.

Before the development of Rocana Search, Rocana Ops used Solr with implicit document routing. While this made it difficult to add shards to an existing index, it allowed us to build a time-based semantic partitioning layer on top of Solr shards, giving us additional scalability on query, as we didn’t have to route every query to all shards.

In production environments, our customers are ingesting billions of events per day, so ingest performance matters. Unfortunately, fixed shard models and very large daily data volumes do not mix well. Eventually you will have too much data in each shard, causing ingest and query to slow dramatically. You’re then left choosing between two bad options:

  1. Create more shards and re-index all data into them (as described above).
  2. Periodically prune data out of the existing shards, which requires deleting data permanently or putting it into “cold” storage, where it is no longer readily accessible for search.

Unfortunately, neither option suited our needs.

The advantages of dynamic sharding

Data coming into Rocana Ops is time-based, which allowed us to create a dynamic sharding model for Rocana Search. In the simplest terms, you can specify that a new shard be created every day on each cluster node: at 100 nodes, that’s 100 new shards every day. If your time partitions are configured appropriately, the dynamic sharding model allows the system to scale over time to retain as much data as you want to keep, while still achieving high rates of data ingest and ultra-low-latency queries. What allows us to utilize this strategy is a two-part sharding model:

  1. We create new shards over time (typically every day), which we call partitions.
  2. We slice each of those daily partitions into smaller pieces, and these slices correspond to actual Lucene directories.

Each node on the cluster will add data to a small number of slices, dividing the work of processing all the messages for a given day across an arbitrary number of nodes as shown in Figure 1.

rocana search partitions slices

Figure 1: Partitions and slices on Rocana Search servers. In this small example, two Rocana Search servers, with two slices (S) per node, have data spanning four time partitions. The number of partitions will grow dynamically as new data comes in.

Each event coming to Rocana Ops has a timestamp. For example, if the data comes from a syslog stream, we use the timestamp on the syslog record, and we route each event to the appropriate time partition based on that timestamp. All queries in Rocana Search are required to define a time range — any given window of time where an item of interest happened. When a query arrives, it will be parsed to determine which of the time partitions on the Rocana Search system are in scope. Rocana Search will then only search that subset.

Ingest performance

The difference in ingestion performance between Solr and Rocana Search is striking. In controlled tests with a small cluster, Rocana Search’s initial performance has proved significantly better — as much as two times — than Solr, and the performance gap grows significantly over time as the systems ingest more data. At the end of these tests, Rocana Search performs in the range of five to six times faster than Solr.

data ingestion comparison

Figure 2: Comparing data ingestion speed of Rocana Search versus Solr over a 48-hour period on the same four-DataNode Hadoop (HDFS) cluster. Rocana Search is able to ingest more than 12.5 billion events, versus 2.4 billion for Solr.

Event size and cardinality can significantly impact ingestion speed for both Solr and Rocana Search. Our tests include both fixed- and variable-sized data, and the results follow our predicted pattern: Rocana Search’s ingestion rate remains relatively steady while Solr’s decreases over time, mirroring what we’ve seen in much larger production environments.

Query performance

Rocana Search’s query performance is competitive with Solr and can outperform Solr while data ingestion is taking place. In querying for data with varying time windows (six hours, one day, three days), we see Solr returning queries quickly for the fastest 50 percent of the queries. Beyond this, Solr query latency starts increasing dramatically, likely due to frequent multisecond periods of unresponsiveness during data ingest.

query performance comparison

Figure 3: Comparing query latency of Rocana Search versus Solr. Query is for time ranges of six hours, one day, and three days, on a 4.2TB dataset on a four-DataNode Hadoop (HDFS) cluster.

Rocana Search’s behavior under ingest load is markedly different than that of Solr. Rocana Search’s query times are much more consistent, well into the 90th percentile of query times. Above the 50th percentile, Rocana Search’s query times edge out Solr across multiple query range sizes. There are several areas where we anticipate being able to extract additional query performance for Rocana Search as we iterate on the solution, which our customers are already using in production.

A solution for petabyte-scale visibility

Effectively managing today’s complex and distributed business environments requires deep visibility into the applications, systems, and networks that support them. Dissatisfaction with standard approaches led us to develop a unique solution that has already been put into production and been proven to work.

By leveraging the time-ordered nature of operations data and a dynamic sharding model built on Lucene, Rocana Search keeps index sizes reasonable, supports high-speed ingest, and maintains performance by restricting time-oriented searches to a subset of the full data set. As a result, Rocana Search is able to scale indexing and searching in a way that other potential solutions can’t match.

As a group of services coordinated across multiple Hadoop DataNodes, Rocana Search creates shards (partitions and slices) on the fly, without manual intervention, server restarts, or the need to re-index already processed data. Ownership of these shards can be automatically transferred to other Rocana Search nodes when nodes are added or removed from the cluster, requiring no manual intervention.

IT operations data has value. The amount of that data you keep should be dictated by business requirements, not the limitations of your search solution. When enterprises face fewer barriers to scaling data ingest and search, they are able to focus on how to search and analyze as much of their IT operations event data as they wish, for as long as they choose, rather than worrying about what data to collect, what to keep, how long to store it, and how to access in the future.

Brad Cupit and Michael Peterson are platform engineers at Rocana.

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data