Zimperium® Deploys ZeroStack’s Private Cloud Solution

Zimperium® Deploys ZeroStack’s Private Cloud Solution

ZeroStack, Inc. has announced that Zimperium has deployed the ZeroStack Intelligent Cloud Platform to speed, streamline, and reduce the cost of its software development.

“We are continually enhancing our software-defined mobile threat defense products, and we need to empower our developers with self-service, cloud-based tools,” said Jerome Brock, senior DevOps Engineerat Zimperium. “By integrating ZeroStack’s Intelligent Cloud Platform onto our bare-metal servers, we have created a self-service DevOps environment that is cost-effective and easy to maintain.”

“Zimperium is the leader in mobile threat defense, and their position depends on their ability to continually enhance their software,” said Kamesh Pemmaraju, vice president of Product Management at ZeroStack. “The ZeroStack Intelligent Cloud Platform helps them empower their developers while retaining full control over their cloud resources.”
 

Source: CloudStrategyMag

WordPress Issues Emergency Patch for SQL Injection Vulnerability

WordPress Issues Emergency Patch for SQL Injection Vulnerability

WordPress announced the security release of version 4.8.3 this week to patch a vulnerability to website takeover through an SQL injection attack.

The Halloween fright, CVE-2017-14723, was discovered and reported to the bug bounty program in September by researcher Anthony Ferrara.

While WordPress core is not affected, according to the new release announcement, the new version hardens it to protect it from attacks via plugins and themes. In version 4.8.2 and earlier, “$wpdb->prepare() can create unexpected and unsafe queries,” allowing potential SQL injection. The new release changes the behavior of the esc_sql() function, which WordPress says will not affect most developers.

The vulnerability traces back to version 4.8.1, but Ferrara says the fix WordPress released with version 4.8.2 dealt with only “a narrow subset of the potential exploits.” 4.8.2 not only failed to actually solve the problem, according to Ferrara, but also rendered many sites and over a million lines of third-party code ineffective. He reported the bug the day after the release of 4.8.2, but WordPress closed his report, on grounds that “non documented functionality is non documented.”

Several messages back and forth followed, before Ferrara threatened on Oct. 16 to publicly report the vulnerability on the 19th. WordPress convinced Ferrara to hold off, and then threatened again on October 20 to take the issue public again on the 25th. Ferrara writes in his report of the disclosure process that the WP security team told him, “[o]ne of our struggles here, as it often is in security, is how to secure things while also breaking as little as possible.”

On the 27th, it seems another member of the WordPress team became involved, and Ferrara finally received the responses he was looking for. He acknowledged in his account of the incident the challenges facing the volunteer team dealing with the issue.

“The miss IMHO isn’t that a team of volunteers isn’t living up to my expectations, but that a platform that powers 25%+ of the Internet (or at least CMS-powered-Internet) isn’t staffed with full time security personnel,” he wrote. “Volunteers are amazing and can only do so much. At some point it comes down to the companies making money off of it and not staffing it that are ultimately the biggest problems…”

WordPress, for its part, thanked Ferrara for practicing responsible disclosure.

Source: TheWHIR

Facebook's WhatsApp Platform Suffers Connectivity Issues

Facebook's WhatsApp Platform Suffers Connectivity Issues

(Bloomberg) — The WhatsApp chat app is suffering a global outage on Friday, with users from the U.K. to Indonesia reporting connectivity issues.

Downdetector, a website that tracks outages, reported that WhatsApp has been having issues since 2:38 a.m. New York time.

WhatsApp did not respond to a request for to comment. A notice on the app said “Our service is experiencing a problem right now. We are working on it and hope to restore functionality shortly.” The chat app, owned by Facebook Inc., has more than 1 billion users.

In August Facebook went down temporarily for some of its more than 2 billion global users after a technical error caused a glitch that blocked access to the social network. Outages are rare for Facebook, which via its social media platform and WhatsApp has become a digital front page for people to read news, share information and communicate with friends and family.

The past 24-hours have been a tricky time for social media platforms. U.S. President Donald Trump’s personal Twitter account went down abruptly for about 11 minutes Thursday evening, a brief deactivation the social media company blamed on an employee who was heading out the door.

Source: TheWHIR

Beating Amazon.com in the Cloud? Europe's Betting on Paris's P19

Beating Amazon.com in the Cloud? Europe's Betting on Paris's P19

(Bloomberg) — What began as a micro-loan from a billionaire, a moving van, and the name P19, now intends to push into the U.S. and rival Amazon.com Inc.’s $12-billion-plus cloud business.

The name stands for Paris and its 19th district, where 42-year-old Octave Klaba set up his first data center after borrowing money from fellow entrepreneur and one of France’s richest people, Xavier Niel, and moving equipment back and forth from hometown Roubaix in the north of France.

Started in 1999, OVH Groupe SAS now has a valuation over $1 billion and is expanding to the U.S., with KKR & Co. and TowerBrook Capital Partners as backers.

“It’s too late for new players who’d want to enter the market at this point, but for us everything remains possible,” Klaba said in an interview in Paris. “Now is the crucial moment for us. We have three years to become a giant, or fail.”

The company builds servers that it assembles into huge cloud computing data centers, leasing out storage and processing power to customers such as tire-maker Michelin, insurer AG2R La Mondiale and British rail ticket retailer Trainline.

But unlike Amazon Web Services, which both hosts and competes with Netflix Inc.’s video content, OVH is a pure-player, avoiding the uncomfortable position of being a threat to customers, Klaba said.

In Europe it has been able to grow sales 30 percent per year and generate profit, flourishing as everything from vehicles to factories become connected via cloud platforms, and companies from carmakers to industrials store exponential amounts of data about their customers. OVH has also built data centers to cater for businesses that need to keep its data in the country or region it does business in.

For the next decade, the demand for space in such centers is expected to drive cloud offerings. The market will grow at 27.5 percent on average each year by 2025, to reach about $1.25 trillion, a report by Research and Markets showed. While U.S. players dominate and Asian giants like Alibaba Group Holding Ltd. fight to catch up, Europe is nowhere to be seen.

Microsoft Corp. reports having spent over $15 billion since 1989 on its data center infrastructure. OVH is a dwarf in comparison, with a plan to spend 1.5 billion euros ($1.7 billion) on infrastructure by 2020. It has built 27 data centers in countries from Poland to Canada, including two in the U.S., in Virginia and Oregon.

‘Not Doomed’

“Success for the smaller players is all about carefully defining and targeting niche markets, specific applications, specific user groups or specific geographic regions,” said John Dinsdale, chief analyst at Synergy Research Group. “OVH is most certainly not doomed. It just needs to figure out how it can build and maintain a position for itself that is not reliant on replicating Amazon AWS, Microsoft Azure and Google Cloud Platform.”

OVH is targeting 1 billion euros ($1.16 billion) in sales by 2020, more than double the 420 million euros it recorded in the year closed end-August. It plans to hire 1,000 additional staff globally in the coming 12 months in fields from coding to servers manufacturing as well as support functions, adding to its current 2,100 employees.

The closely-held company raised 250 million euros from financial investors KKR and Towerbrook in 2016, followed by 400 million euros in debt this year, and doesn’t need more money this point, Klaba said. As it starts drafting its next strategic plan, which may include expanding into places like China, Russia and Brazil, the company will weigh whether it makes sense to go public.

“We’re the only European cloud provider with the potential to scale, in an industry where critical mass is essential,” Klaba said. “We have the capacity to invest, to grow, to innovate. Our main challenge is recruitment.”

Source: TheWHIR

VMware To Acquire VeloCloud™ Networks

VMware To Acquire VeloCloud™ Networks

VMware, Inc. has announced that it has signed a definitive agreement to acquire VeloCloud™ Networks, Inc., provider of industry-leading cloud-delivered software-defined wide-area network (SD-WAN) technology for enterprises and service providers. Once the acquisition closes, VeloCloud will enable VMware to build on the success of its industry-leading network virtualization platform — VMware NSX® — and expand its networking portfolio to address end-to-end automation, application continuity, branch transformation, and security from data center to cloud to edge. This acquisition will also further enable VMware to lead the industry transition to a software-defined future, and help customers bring their businesses into the digital era with networking that is ubiquitous, open, programmable and secure by default.

The transaction is expected to close in VMware’s fiscal Q4 2018. There is no change to VMware’s previously provided fiscal 2018 guidance due to this transaction.

According to Gartner, “While WAN architectures and technologies tend to evolve at a very slow pace — perhaps a new generation every 10 to 15 years — the disruptions caused by the transformation to digital business models are driving adoption of SD-WAN at a pace that is unheard of in wide-area networking.1

” VeloCloud cloud-delivered SD-WAN technology is deployed globally at-scale by more than 1,000 customers, both directly by enterprises and by Telcos and managed services providers serving enterprise customers. Service provider customers include AT&T, Deutsche Telekom, Macquarie Telecom, MetTel, Mitel, Sprint, TelePacific, Telstra, Vonage and Windstream. Enterprise customers include Bay Club, Brooks Brothers, Devcon, NCR, Redmond, Saber Healthcare Group, and Triton Management Services.”

“In the digital era, a new networking approach is required to solve the hyper distribution of applications and data, as we move from a model of data centers to one of centers of data at the edge,” said Pat Gelsinger, chief executive officer, VMware. “At the heart of VMware’s networking strategy is the belief in delivering pervasive connectivity with embedded security that connects users to applications wherever they may be. With the addition of VeloCloud’s industry-leading SD-WAN technology, we will be able to extend the VMware NSX approach of automated, secure, and infrastructure-independent networking to the WAN.”

“Enterprises are transforming how they architect and utilize their infrastructure, with a shift towards a cloud-delivered, software-defined model. This enables organizations to have a globally consistent infrastructure regardless of where it is deployed — from the data center and the cloud to the edge,” said Sanjay Uppal, CEO of VeloCloud Networks. “We look forward to helping VMware, the leader in software-defined infrastructure, in the next evolution of the company’s networking and NFV strategies.”

Leading with Cloud-Delivered SD-WAN
VeloCloud’s cloud-delivered SD-WAN combines the economics and flexibility of the hybrid wide-area network (WAN) with the deployment speed and low maintenance of cloud-based services. It dramatically simplifies the WAN by delivering virtualized services from the cloud to branch offices and mobile users everywhere. VeloCloud leverages intelligent x86 edge appliances to aggregate multiple broadband links at the branch office, and using cloud-based orchestration, connects the branch office to any of type of data center: enterprise, cloud, or software-as-a-service.

With VeloCloud, VMware will enable enterprises to support application growth, network agility, and simplified branch implementations while delivering high-performance, secure, reliable branch access to cloud services, private data centers and SaaS-based applications. SD-WAN technology is ideal for businesses looking to make the transition from static, complex, on-premises networking to the cost-effective, dynamic, and scalable cloud-delivered architecture of the digital era. The VeloCloud solution provides flexibility in network connectivity options that can augment MPLS and improves overall total cost of ownership for branch connectivity.

VeloCloud will enable VMware to help service providers increase revenue and service innovation by delivering elastic transport, performance for cloud applications and a software-defined intelligent edge that can orchestrate multiple services to meet customer needs. With SD-WAN becoming the primary function in virtual customer-premise equipment deployments, VMware expects to be able to simplify the deployment of virtual network functions (VNF) for applications such as security by combining the proven VMware vCloud® NFV platform with a cloud-delivered SD-WAN platform.

“Dell EMC and VMware are committed to digitally transforming branches, the wide-area network and the cloud edge,” said Tom Burns, senior vice president, Networking, Enterprise Infrastructure and Service Provider Solutions, Dell EMC. “We recently announced a partnership with VeloCloud that includes joint product validation, coordination with product roadmaps, simplified ordering, and coordinated sales and marketing to improve solutions for our mutual customers. We look forward to continuing this SD-WAN partnership with VMware upon closing to offer mutual customers best-in-class intelligent edge appliances.”

Guiding Customers to the Software-Defined Future
VMware’s software-based approach is delivering the networking and security platform that enables customers to connect, secure and operate an end-to-end architecture to deliver services to the application wherever it may land. Customers choose VMware NSX because it delivers network and security services closest to the application. With VeloCloud, VMware will bring the same properties to the WAN, resulting in visibility, security, automation with performance, and availability for enterprise and cloud applications.

“Digital transformation has brought about a growing dependency on the network as mobility, cloud, and social business erase many of the barriers pertaining to time and place in the enterprise. Advances in IoT are also driving dependency on the network,” said Matt Eastwood, senior vice president of IDC’s enterprise, data center, cloud infrastructure, and developer research. “The network is becoming more agile, enabled by a new generation of software-based platforms from companies such as VMware and VeloCloud. We see a positive synergy between the two companies, and the opportunity for VMware to build upon the software-based networking strategy the company has been executing on.”

 

1. Forecast: SD-WAN and Its Impact on Traditional Router and MPLS Services Revenue, Worldwide, 2016-2020, Gartner, November 7, 2016, Document: G00317430, https://www.gartner.com/doc/3505022?ref=ddisp

Source: CloudStrategyMag

Microsoft, Oracle, IBM Are Said to Alter Pay to Push Cloud Sales

Microsoft, Oracle, IBM Are Said to Alter Pay to Push Cloud Sales

(Bloomberg) — Microsoft Corp., Oracle Corp. and IBM — looking to stoke demand for cloud computing services — are said to be shifting incentives for their sales representatives, pushing them to ensure customers become active users over the long haul.

Microsoft in July revamped the way it pays its sales staff to tie incentives to how much customers actually use cloud-based software — rather than how many sign a contract for cloud services, according to sales chief Judson Althoff. Oracle has been rolling out new rewards for at least some employees that also are connected to customers’ use of its cloud services, according to people familiar with the matter.

International Business Machines Corp. in the past year has restructured its cloud sales team and tied compensation more closely to usage, according to other people with knowledge of the matter. Traditionally, companies would ink large software deals based on factors such as the number of a customer’s devices — and not actual subsequent use of the products.

The cloud business is a crucial growth area for the traditional enterprise technology pioneers, battling against rivals Amazon.com Inc. and Alphabet Inc.’s Google. The public cloud services global market is likely to increase more than 18 percent to $260.2 billion this year and almost double to $411 billion in 2020, according to Gartner Inc. Microsoft, for example, said last week it had generated $20.4 billion in commercial cloud revenue on an annualized basis. Tying usage to sales incentives should help keep customers on board when it’s time to agree to a new contract, said Stephen White, an analyst with Gartner.

“The behaviors of the salespeople need to be more in tune with what a customer actually is going to need and use,” White said. “It certainly makes the renewal discussion easier.”

Oracle and IBM declined to comment.

Previously, Microsoft had been bundling cloud services, such as Azure for storing and running data and cloud applications, with many of its multiyear deals. Althoff said the shift in pay incentives is a significant change.

“We did have ill-informed behaviors,” he said. “We tried to sell Azure the same way we tried to sell everything else at Microsoft, which is adding it into our enterprise agreement. People were like ‘Do you want fries with that? Do you want Azure with that?’ That didn’t drive any meaningful work.”

The incentive plan change fits with Chief Executive Officer Satya Nadella’s aim to encourage Microsoft’s products to be used and loved rather than merely paid for and tolerated.

IBM has been emphasizing selling cloud infrastructure services and software and tools geared toward specific business processes and industries such as health care and finance. Oracle has been turning its focus to the cloud as well and investing in staff. The company said in August it was adding more than 5,000 people, including in sales, for its cloud business – following other related hires earlier in the year in the U.S.

While Amazon remains the largest provider of cloud computing infrastructure, the traditional companies are showing signs of improvement. In its last quarter, Microsoft’s Azure service,  grew 90 percent while Office 365 increased 42 percent. Oracle reported that its overall cloud sales expanded by more than 50 percent during its last period to $1.5 billion and IBM’s sales in the market jumped about 20 percent in its third quarter to $4.1 billion.

Source: TheWHIR

ManageEngine Updates Its Cloud-Based Service Desk Software

ManageEngine Updates Its Cloud-Based Service Desk Software

ManageEngine has announced that it is bringing a unified approach to enterprise service management with an update to the cloud version of its flagship ITSM product, ServiceDesk Plus. With the ability to launch and manage multiple service desk instances on the go, organizations can now leverage proven IT service management (ITSM) best practices to streamline business functions for non-IT departments, including HR, facilities and finance. Available immediately, the ServiceDesk Plus cloud version comes loaded with built-in templates unique to various business processes, giving users the flexibility to perform codeless customizations for quick and easy deployment of business services.

Within any organization, employees consume services provided by various departments on a daily basis. While each department offers unique services, the processes and workflows associated with those services follow a pattern similar to that of IT service management. However, organizations often implement ITSM workflows only within their IT department, seldom leveraging these ITSM best practices to manage service delivery across other departments.

“Traditionally, the best practices of service management have only been available to the IT functions of an organization. Other departments, despite the mandate of servicing endusers, make do with processes and tools unique to their domain while not tapping into established standards followed by IT,” said Rajesh Ganesan, director of product management at ManageEngine. “ServiceDesk Plus takes the collective lessons from IT and brings an integrated approach to service management that cuts across different departments to deliver a consistent user experience and provide centralized visibility of all services.”

“Having separate service desk instances for IT, facilities and records allows us to track the issues separately while giving us access to the other departments’ resources. With the new version of ServiceDesk Plus, we feel like the firm’s support and administration departments are working together to provide assistance,” said Beverley Seche, network administrator at Stark & Stark, Attorneys at Law. “I love that it’s customizable, easy to use and available at a great price.”

Reimagining ITSM for Business Operations

Service operations by business teams closely align with fundamental service management processes. Unifying service operations across an organization helps provide a consistent experience for end users. Whether an employee requests information from HR or submits a work order to facilities, non-IT service requests often follow a similar workflow to that of any IT service request. So, instead of disparate applications and disjointed processes, organizations can use a centralized service desk to facilitate request logging and tracking, task automation and delegation as well as request fulfillment and feedback. With a unified service desk, each department can have its own service desk instance with templates and workflows inspired by existing IT service management processes.

Becoming a Rapid-Start Enterprise Service Desk

To date, ServiceDesk Plus has focused on providing ITSM best practices to the IT end of business. By discovering the common thread between the different service management activities within an enterprise, ServiceDesk Plus is now able to carry its industry-leading capabilities beyond IT. As an enterprise service desk, ServiceDesk Plus helps organizations instantly deploy ITSM solutions for their supporting business units by providing:

  • Rapid deployment: Create, deploy and roll-out a service desk instance in less than 60 secs.
  • Single enterprise directory: Maintain users, service desks, authentications and associations in one place.
  • Unique service desk instances: Create separate service desk instances for each business function and facilitate organized service delivery using code-free customizations.
  • Service automation: Implement ITSM workflows to efficiently manage all aspects of the business service life cycle.
  • Built-in catalog and templates: Accelerate service management adoption across departments by using prebuilt templates and service catalogs unique to each business unit.
  • Centralized request portal: Showcase all the services that endusers require using a single portal based on each individual’s access permissions.

Source: CloudStrategyMag

How the World of Connected Things is Changing Cloud Services Delivery

How the World of Connected Things is Changing Cloud Services Delivery

I recently led a session at the latest Software-Defined Enterprise Conference and Expo (SDxE) where we discussed how connected “things” are going to reshape our business, the industry, and our lives. When I asked the people in that full room how many had more than two devices that could connect into the cloud, pretty much every hand went up.

We’re living in a truly interconnected world. One that continues to generate more data, find more ways to give analog systems a digital heartbeat, and one that shapes lives using new technologies.

A recent Cisco Visual Networking Index report indicated that smartphone traffic will exceed PC traffic by 2021. In 2016, PCs accounted for 46 percent of total IP traffic, but by 2021 PCs will account for only 25 percent of traffic. Smartphones will account for 33 percent of total IP traffic in 2021, up from 13 percent in 2016. PC-originated traffic will grow at a CAGR of 10 percent, while TVs, tablets, smartphones, and Machine-to- Machine (M2M) modules will have traffic growth rates of 21 percent, 29 percent, 49 percent, and 49 percent, respectively.

Cloud services are accelerated in part by the unprecedented amounts of data being generated by not only people, but also machines and things. And, not just generated, but stored as well. The latest Cisco GCI report estimates that 600 ZB will be generated by all people, machines, and things by 2020, up from 145 ZB generated in 2015. And, by 2020, data center storage installed capacity will grow to 1.8 ZB, up from 382 EB in 2015, nearly a 5-fold growth.

When it comes to IoT, there’s no slowdown in adoption. Cisco’s report indicates that within the enterprise segment, database/analytics and IoT will be the fastest growing applications, with 22 percent CAGR from 2015 to 2020, or 2.7-fold growth. Growth in machine-to-machine connections and applications is also driving new data analytics needs. When it comes to connected things, we have to remember that IoT applications have very different characteristics. In some cases, application analytics and management can occur at the edge device level whereas for others it is more appropriately handled centrally, typically hosted in the cloud.

Cloud will evolve to support the influx of connected things

It’s not just how many new things are connected into the cloud. We must also remember the data that these devices or services are creating. Cloud services are already adapting to support a much more distributed organization; with better capabilities to support users, their devices, and their most critical services. Consider the following.

  • The edge is growing and helping IoT initiatives. A recent MarketsAndMarkets report indicated that CDN vendors help the organizations in efficiently delivering content to their end users with better QoE and QoS. CDN also facilitates organizations to store their contents near its target users and get it secured from the attacks like DDoS. The report indicated the sheer size of the potential market, where this segment is expected to grow from $4.95 billion in 2015 to $15.73 billion in 2020. With more data being created, organizations are working hard to find ways they can deliver this information to their consumers and users. This shift in data consumption has changed the way we utilize data center technologies and delivery critical data points.
  • Cloud is elastic – but we have to understand our use-cases and adopt them properly. There are so many powerful use-cases and technologies we can leverage within the cloud already. WANOP, SD-WAN, CDNs, hybrid cloud, and other solutions are allowing us to connect faster and leverage our devices efficiently. When working with end-users, make sure you know which type of services you require. For example, in some cases, you might need a hyperscale data center platform rather than a public cloud. In some cases, you still need granular control over the proximity of data to the remote application. This is where you need to decide between public cloud options and those of hyperscale providers. There’s no right or wrong here – just the right use-case for the right service.
  • Security will continue to remain a top concern. Recent research from Juniper suggests that the rapid digitization of consumers’ lives and enterprise records will increase the cost of data breaches to $2.1 trillion globally by 2019, increasing to almost four times the estimated cost of breaches in 2015. That’s trillion with a ‘t’. The report goes on to say that the average cost of a data breach in 2020 will exceed $150 million by 2020, as more business infrastructure gets connected. Remember, the data that we create isn’t benign. In fact, it’s very valuable to us, businesses, and the bad guys. Ensuring device and data security best practices will help you protect your brand and keep user confidence high. Depending on the device and your industry – make sure to really plan out your device and data security ecosystem. And, it’ll always be important to ensure your security plan is agile and can adapt to a constantly evolving digital market.

Powerful cloud services are become a major part of our connected society. We’ve come to rely on things like file sharing, application access, and connected physical devices, and much more to help us through our daily lives and in the business world. The goal of cloud services will be to enable these types of connections in a transparent manner.

Cloud services will continue to evolve to support data and device requirements. The goal of the organization (and the user) will be to ensure you’re using the right types of services. With that, keep an eye on the edge – it’ll continue to shape the way we leverage cloud, connected devices, and the data we create.

Source: TheWHIR

Dremio: Simpler and faster data analytics

Dremio: Simpler and faster data analytics

Now is a great time to be a developer. Over the past decade, decisions about technology have moved from the boardroom to innovative developers, who are building with open source and making decisions based on the merits of the underlying project rather than the commercial relationships provided by a vendor. New projects have emerged that focus on making developers more productive, and that are easier to manage and scale. This is true for virtually every layer of the technology stack. The result is that developers today have almost limitless opportunities to explore new technologies, new architectures, and new deployment models.

Looking at the data layer in particular, NoSQL systems such as MongoDB, Elasticsearch, and Cassandra have pushed the envelope in terms of agility, scalability, and performance for operational applications, each with a different data model and approach to schema. Along the way many development teams moved to a microservices model, spreading application data across many different underlying systems.

In terms of analytics, old and new data sources have found their way into a mix of traditional data warehouses and data lakes, some on Hadoop, others on Amazon S3. And the rise of the Kafka data streaming platform creates an entirely different way of thinking about data movement and analysis of data in motion.

With data in so many different technologies and underlying formats, analytics on modern data is hard. BI and analytics tools such as Tableau, Power BI, R, Python, and machine learning models were designed for a world in which data lives in a single, high-performance relational database. In addition, users of these tools – business analysts, data scientists, and machine learning models – want the ability to access, explore, and analyze data on their own, without any dependency on IT.

Introducing the Dremio data fabric

BI tools, data science systems, and machine learning models work best when data lives in a single, high-performance relational database. Unfortunately, that’s not where data lives today. As a result, IT has no choice but to bridge that gap through a combination of custom ETL development and proprietary products. In many companies, the analytics stack includes the following layers:

  • Data staging. The data is moved from various operational databases into a single staging area such as a Hadoop cluster or cloud storage service (e.g., Amazon S3).
  • Data warehouse. While it is possible to execute SQL queries directly on Hadoop and cloud storage, these systems are simply not designed to deliver interactive performance. Therefore, a subset of the data is usually loaded into a relational data warehouse or MPP database.
  • Cubes, aggregation tables, and BI extracts. In order to provide interactive performance on large datasets, the data must be pre-aggregated and/or indexed by building cubes in an OLAP system or materialized aggregation tables in the data warehouse.

This multi-layer architecture introduces many challenges. It is complex, fragile, and slow, and creates an environment where data consumers are entirely dependent on IT.

Dremio introduces a new tier in data analytics we call a self-service data fabric. Dremio is an open source project that enables business analysts and data scientists to explore and analyze any data at any time, regardless of its location, size, or structure. Dremio combines a scale-out architecture with columnar execution and acceleration to achieve interactive performance on any data volume, while enabling IT, data scientists, and business analysts to seamlessly shape the data according to the needs of the business.

Built on Apache Arrow, Apache Parquet, and Apache Calcite

Dremio utilizes high-performance columnar storage and execution, powered by Apache Arrow (columnar in memory) and Apache Parquet (columnar on disk). Dremio also uses Apache Calcite for SQL parsing and query optimization, building on the same libraries as many other SQL-based engines, such as Apache Hive.

Apache Arrow is an open source project that enables columnar in-memory data processing and interchange. Arrow was created by Dremio, and includes committers from various companies including Cloudera, Databricks, Hortonworks, Intel, MapR, and Two Sigma.

Dremio is the first execution engine built from the ground up on Apache Arrow. Internally, the data in memory is maintained off-heap in the Arrow format, and there will soon be an API that returns query results as Arrow memory buffers.

A variety of other projects have embraced Arrow as well. Python (Pandas) and R are among these projects, enabling data scientists to work more efficiently with data. For example, Wes McKinney, creator of the popular Pandas library, recently demonstrated how Arrow enables Python users to read data into Pandas at over 10 GB/s.

How Dremio enables self-service data

In addition to the ability to work interactively with their datasets, data engineers, business analysts, and data scientists also need a way to curate the data so that it is suitable for the needs of a specific project. This is a fundamental shift from the IT-centric model, where consumers of data initiate a request for a dataset and wait for IT to fulfill their request weeks or months later. Dremio enables a self-service model, where consumers of data use Dremio’s data curation capabilities to collaboratively discover, curate, accelerate, and share data without relying on IT.

All of these capabilities are accessible through a modern, intuitive, web-based UI:

  • Discover. Dremio includes a unified data catalog where users can discover and explore physical and virtual datasets. The data catalog is automatically updated when new data sources are added, and as data sources and virtual datasets evolve. All metadata is indexed in a high-performance, searchable index, and exposed to users throughout the Dremio interface.
  • Curate. Dremio enables users to curate data by creating virtual datasets. A variety of point-and-click transformations are supported, and advanced users can utilize SQL syntax to define more complex transformations. As queries execute in the system, Dremio learns about the data, enabling it to recommend various transformations such as joins and data type conversions.
  • Dremio is capable of accelerating datasets by up to 1000x over the performance of the source system. Users can vote for datasets they think should be faster, and Dremio’s heuristics will consider these votes in determining which datasets to accelerate. Optionally, system administrators can manually determine which datasets to accelerate.
  • Dremio enables users to securely share data with other users and groups. In this model a group of users can collaborate on a virtual dataset that will be used for a particular analytical job. Alternately, users can upload their own data, such as Excel spreadsheets, to join to other datasets from the enterprise catalog. Creators of virtual datasets can determine which users can query or edit their virtual datasets. It’s like Google Docs for your data.

How Dremio data acceleration works

Dremio utilizes highly optimized physical representations of source data called Data Reflections. The Reflection Store can live on HDFS, MapR-FS, cloud storage such as S3, or direct-attached storage (DAS). The Reflection Store size can exceed that of physical memory. This architecture enables Dremio to accelerate more data at a lower cost, resulting in a much higher cache hit ratio compared to traditional memory-only architectures. Data Reflections are automatically utilized by the cost-based optimizer at query time.

Data Reflections are invisible to end users. Unlike OLAP cubes, aggregation tables, and BI extracts, the user does not explicitly connect to a Data Reflection. Instead, users issue queries against the logical model, and Dremio’s optimizer automatically accelerates the query by taking advantage of the Data Reflections that are suitable for the query based on the optimizer’s cost analysis.

When the optimizer cannot accelerate the query, Dremio utilizes its high-performance distributed execution engine, leveraging columnar in-memory processing (via Apache Arrow) and advanced push-downs into the underlying data sources (when dealing with RDBMS or NoSQL sources).

How Dremio handles SQL queries

Client applications issue SQL queries to Dremio over ODBC, JDBC, or REST. A query might involve one or more datasets, potentially residing in different data sources. For example, a query may be a join between a Hive table, Elasticsearch, and several Oracle tables.

Dremio utilizes two primary techniques to reduce the amount of processing required for a query:

  • Push-downs into the underlying data source. The optimizer will consider the capabilities of the underlying data source and the relative costs. It will then generate a plan that performs stages of the query either in the source or in Dremio’s distributed execution environment to achieve the most efficient overall plan possible.
  • Acceleration via Data Reflections. The optimizer will use Data Reflections for portions of the query when this produces the most efficient overall plan. In many cases the entire query can be serviced from Data Reflections as they can be orders of magnitude more efficient than processing queries in the underlying data source.

Query push-downs

Dremio is able to push down processing into relational and non-relational data sources. Non-relational data sources typically do not support SQL and have limited execution capabilities. A file system, for example, cannot apply predicates or aggregations. MongoDB, on the other hand, can apply predicates and aggregations, but does not support all joins. The Dremio optimizer understands the capabilities of each data source. When it is most efficient, Dremio will push as much of a query to the underlying source as possible, and performs the rest in its own distributed execution engine.

Offloading operational databases

Most operational databases are designed for write-optimized workloads. Furthermore, these deployments must address stringent SLAs, as any downtime or degraded performance can significantly impact the business. As a result, operational systems are frequently isolated from processing analytical queries. In these cases Dremio can execute analytical queries using Data Reflections, which provide the most efficient query processing possible while minimizing the impact on the operational system. Data Reflections are updated periodically based on policies that can be configured on a table by table basis.

Query execution phases

The life of a query includes the following phases:

  1. Client submits query to coordinator via ODBC/JDBC/REST
  2. Planning
    1. Coordinator parses query into Dremio’s universal relational model
    2. Coordinator considers available statistics on data sources to develop query plan, as well as functional abilities of the source
  3. Coordinator rewrites query plan to use
    1. the available Data Reflections, considering ordering, partitioning, and distribution of the Data Reflections and
    2. the available capabilities of the data source
  4. Execution
  1. Executors read data into Arrow buffers from sources in parallel
    1. Executors execute the rewritten query plan.
    2. One executor merges the results from one or more executors and streams the final results to the coordinator
  1. Client receives the results from the coordinator

Note that the data may come from Data Reflections or the underlying data source(s). When reading from a data source, the executor submits the native queries (e.g. MongoDB MQL, Elasticsearch Query DSL, Microsoft Transact-SQL) as determined by the optimizer in the planning phase.

All data operations are performed on the executor node, enabling the system to scale to many concurrent clients using only a few coordinator nodes.

Example query push-down

To illustrate how Data Fabric fits into your data architecture, let’s take a closer look at running a SQL query on a source that doesn’t support SQL.

One of the more popular modern data sources is Elasticsearch. There is a lot to like about Elasticsearch, but in terms of analytics it doesn’t support SQL (including SQL joins). That means tools like Tableau and Excel can’t be used to analyze data from applications built on this data store. There is a visualization project called Kibana that is popular for Elasticsearch, but Kibana is designed for developers. It’s not really for business users.

Dremio makes it easy to analyze data in Elasticsearch with any SQL-based tool, including Tableau. Let’s take for example the following SQL query for Yelp business data, which is stored in JSON:

SELECT state, city, name, review_count
FROM elastic.yelp.business
WHERE
  state NOT IN (‘TX’,’UT’,’NM’,’NJ’) AND
  review_count > 100
ORDER BY review_count DESC, state, city
LIMIT 10

Dremio compiles the query into an expression that Elasticsearch can process:

{
"from" : 0,
"size" : 4000,
"query" : {
"bool" : {
"must" : [ {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "TX",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "UT",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "NM",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "NJ",
"type" : "boolean"
}
}
}
}
}, {
"range" : {
"review_count" : {
"from" : 100,
"to" : null,
"include_lower" : false,
"include_upper" : true
}
}
} ]
}
}
}

There’s really no limit to the SQL that can be executed on Elasticsearch or any supported data source with Dremio. Here is a slightly more complex example that involves a windowing expression:

SELECT
city,
name,
bus_review_count,
bus_avg_stars,
city_avg_stars,
all_avg_stars
FROM (
SELECT
city,
name,
bus_review_count,
bus_avg_stars,
AVG(bus_avg_stars) OVER (PARTITION BY city) AS city_avg_stars,
AVG(bus_avg_stars) OVER () AS all_avg_stars,
SUM(bus_review_count) OVER () AS total_reviews
FROM (
SELECT
city,
name,
AVG(review.stars) AS bus_avg_stars,
COUNT(review.review_id) AS bus_review_count
FROM
elastic.yelp.business AS business
LEFT OUTER JOIN elastic.yelp.review AS review ON business.business_id = review.business_id
GROUP BY
city, name
)
)
WHERE bus_review_count > 100
ORDER BY bus_avg_stars DESC, bus_review_count DESC

This query asks how top-rated businesses compare to other businesses in each city. It looks at the average review for each business with more than 100 reviews compared to the average for all businesses in the same city. To perform this query, data from two different datasets in Elasticsearch must be joined together, an action that Elasticsearch doesn’t support. Parts of the query are compiled into expressions Elasticsearch can process, and the rest of the query is evaluated in Dremio’s distributed SQL execution engine.

If we were to create a Data Reflection on one of these datasets, Dremio’s query planner would automatically rewrite the query to use the Data Reflection instead of performing this push-down operation. The user wouldn’t need to change their query or connect to a different physical resource. They would simply experience reduced latency, sometimes by as much as 1000x less depending on the source and complexity of the query.

An open source, industry standard data platform

Analysis and data science is about iterative investigation and exploration of data. Regardless of the complexity and scale of today’s datasets, analysts need to make fast decisions and iterate, without waiting for IT to provide or prepare the data.

To deliver true self-sufficiency, a self-service data fabric should be expected to deliver data faster than the underlying infrastructure. It must understand how to cache various representations of the data in analytically optimized formats and pick the right representations based on freshness expectations and performance requirements. And it must do all of this in a smart way, without relying on explicit knowledge management and sharing.

Data Reflections are a sophisticated way to cache representations of data across many sources, applying multiple techniques to optimize performance and resource consumption. Through Data Reflections, Dremio allows any user’s interaction with any dataset (virtual or physical) to be autonomously routed through sophisticated algorithms.

As the number and variety of data sources in your organization continue to grow, investing and relying on a new tier in your data stack will become necessary. You will need to find a solution built on open source technology, that itself has an open source core that is built on industry standard technologies. Dremio provides a powerful execution and persistence layer built upon Apache Arrow, Apache Calcite, and Apache Parquet, three key pillars for the next generation of data platforms.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (Adobe), and aQuantive (Microsoft).

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data

Apache PredictionIO: Easier machine learning with Spark

Apache PredictionIO: Easier machine learning with Spark

The Apache Foundation has added a new machine learning project to its roster, Apache PredictionIO, an open-sourced version of a project originally devised by a subsidiary of Salesforce.

What PredictionIO does for machine learning and Spark

Apache PredictionIO is built atop Spark and Hadoop, and serves Spark-powered predictions from data using customizable templates for common tasks. Apps send data to PredictionIO’s event server to train a model, then query the engine for predictions based on the model.

Spark, MLlib, HBase, Spray, and and Elasticsearch all come bundled with PredictionIO, and Apache offers supported SDKs for working in Java, PHP, Python, and Ruby. Data can be stored in a variety of back ends: JDBC, Elasticsearch, HBase, HDFS, and their local file systems are all supported out of the box. Back ends are pluggable, so a developer can create a custom back-end connector.

How PredictionIO templates make it easier to serve predictions from Spark

PredictionIO’s most notable advantage is its template system for creating machine learning engines. Templates reduce the heavy lifting needed to set up the system to serve specific kinds of predictions. They describe any third-party dependencies that might be needed for the job, such as the Apache Mahout machine-learning app framework.

Some existing templates include:

Some templates also integrate with other machine learning products. For example, two of the prediction templates currently in PredictionIO’s gallery, for churn rate detection and general recommendations, use H2O.ai’s Sparkling Water enhancements for Spark.

PredictionIO can also automatically evaluate a prediction engine to determine the best hyperparameters to use with it. The developer needs to choose and set metrics for how to do this, but there’s generally less work involved in doing this than in tuning hyperparameters by hand.

When running as a service, PredictionIO can accept predictions singly or as a batch. Batched predictions are automatically parallelized across a Spark cluster, as long as the algorithms used in a batch prediction job are all serializable. (PredictionIO’s default algorithms are.)

Where to download PredictionIO

PredictionIO’s source code is available on GitHub. For convenience, various Docker images are available, as well as a Heroku build pack.

Source: InfoWorld Big Data