Dremio: Simpler and faster data analytics

Dremio: Simpler and faster data analytics

Now is a great time to be a developer. Over the past decade, decisions about technology have moved from the boardroom to innovative developers, who are building with open source and making decisions based on the merits of the underlying project rather than the commercial relationships provided by a vendor. New projects have emerged that focus on making developers more productive, and that are easier to manage and scale. This is true for virtually every layer of the technology stack. The result is that developers today have almost limitless opportunities to explore new technologies, new architectures, and new deployment models.

Looking at the data layer in particular, NoSQL systems such as MongoDB, Elasticsearch, and Cassandra have pushed the envelope in terms of agility, scalability, and performance for operational applications, each with a different data model and approach to schema. Along the way many development teams moved to a microservices model, spreading application data across many different underlying systems.

In terms of analytics, old and new data sources have found their way into a mix of traditional data warehouses and data lakes, some on Hadoop, others on Amazon S3. And the rise of the Kafka data streaming platform creates an entirely different way of thinking about data movement and analysis of data in motion.

With data in so many different technologies and underlying formats, analytics on modern data is hard. BI and analytics tools such as Tableau, Power BI, R, Python, and machine learning models were designed for a world in which data lives in a single, high-performance relational database. In addition, users of these tools – business analysts, data scientists, and machine learning models – want the ability to access, explore, and analyze data on their own, without any dependency on IT.

Introducing the Dremio data fabric

BI tools, data science systems, and machine learning models work best when data lives in a single, high-performance relational database. Unfortunately, that’s not where data lives today. As a result, IT has no choice but to bridge that gap through a combination of custom ETL development and proprietary products. In many companies, the analytics stack includes the following layers:

  • Data staging. The data is moved from various operational databases into a single staging area such as a Hadoop cluster or cloud storage service (e.g., Amazon S3).
  • Data warehouse. While it is possible to execute SQL queries directly on Hadoop and cloud storage, these systems are simply not designed to deliver interactive performance. Therefore, a subset of the data is usually loaded into a relational data warehouse or MPP database.
  • Cubes, aggregation tables, and BI extracts. In order to provide interactive performance on large datasets, the data must be pre-aggregated and/or indexed by building cubes in an OLAP system or materialized aggregation tables in the data warehouse.

This multi-layer architecture introduces many challenges. It is complex, fragile, and slow, and creates an environment where data consumers are entirely dependent on IT.

Dremio introduces a new tier in data analytics we call a self-service data fabric. Dremio is an open source project that enables business analysts and data scientists to explore and analyze any data at any time, regardless of its location, size, or structure. Dremio combines a scale-out architecture with columnar execution and acceleration to achieve interactive performance on any data volume, while enabling IT, data scientists, and business analysts to seamlessly shape the data according to the needs of the business.

Built on Apache Arrow, Apache Parquet, and Apache Calcite

Dremio utilizes high-performance columnar storage and execution, powered by Apache Arrow (columnar in memory) and Apache Parquet (columnar on disk). Dremio also uses Apache Calcite for SQL parsing and query optimization, building on the same libraries as many other SQL-based engines, such as Apache Hive.

Apache Arrow is an open source project that enables columnar in-memory data processing and interchange. Arrow was created by Dremio, and includes committers from various companies including Cloudera, Databricks, Hortonworks, Intel, MapR, and Two Sigma.

Dremio is the first execution engine built from the ground up on Apache Arrow. Internally, the data in memory is maintained off-heap in the Arrow format, and there will soon be an API that returns query results as Arrow memory buffers.

A variety of other projects have embraced Arrow as well. Python (Pandas) and R are among these projects, enabling data scientists to work more efficiently with data. For example, Wes McKinney, creator of the popular Pandas library, recently demonstrated how Arrow enables Python users to read data into Pandas at over 10 GB/s.

How Dremio enables self-service data

In addition to the ability to work interactively with their datasets, data engineers, business analysts, and data scientists also need a way to curate the data so that it is suitable for the needs of a specific project. This is a fundamental shift from the IT-centric model, where consumers of data initiate a request for a dataset and wait for IT to fulfill their request weeks or months later. Dremio enables a self-service model, where consumers of data use Dremio’s data curation capabilities to collaboratively discover, curate, accelerate, and share data without relying on IT.

All of these capabilities are accessible through a modern, intuitive, web-based UI:

  • Discover. Dremio includes a unified data catalog where users can discover and explore physical and virtual datasets. The data catalog is automatically updated when new data sources are added, and as data sources and virtual datasets evolve. All metadata is indexed in a high-performance, searchable index, and exposed to users throughout the Dremio interface.
  • Curate. Dremio enables users to curate data by creating virtual datasets. A variety of point-and-click transformations are supported, and advanced users can utilize SQL syntax to define more complex transformations. As queries execute in the system, Dremio learns about the data, enabling it to recommend various transformations such as joins and data type conversions.
  • Dremio is capable of accelerating datasets by up to 1000x over the performance of the source system. Users can vote for datasets they think should be faster, and Dremio’s heuristics will consider these votes in determining which datasets to accelerate. Optionally, system administrators can manually determine which datasets to accelerate.
  • Dremio enables users to securely share data with other users and groups. In this model a group of users can collaborate on a virtual dataset that will be used for a particular analytical job. Alternately, users can upload their own data, such as Excel spreadsheets, to join to other datasets from the enterprise catalog. Creators of virtual datasets can determine which users can query or edit their virtual datasets. It’s like Google Docs for your data.

How Dremio data acceleration works

Dremio utilizes highly optimized physical representations of source data called Data Reflections. The Reflection Store can live on HDFS, MapR-FS, cloud storage such as S3, or direct-attached storage (DAS). The Reflection Store size can exceed that of physical memory. This architecture enables Dremio to accelerate more data at a lower cost, resulting in a much higher cache hit ratio compared to traditional memory-only architectures. Data Reflections are automatically utilized by the cost-based optimizer at query time.

Data Reflections are invisible to end users. Unlike OLAP cubes, aggregation tables, and BI extracts, the user does not explicitly connect to a Data Reflection. Instead, users issue queries against the logical model, and Dremio’s optimizer automatically accelerates the query by taking advantage of the Data Reflections that are suitable for the query based on the optimizer’s cost analysis.

When the optimizer cannot accelerate the query, Dremio utilizes its high-performance distributed execution engine, leveraging columnar in-memory processing (via Apache Arrow) and advanced push-downs into the underlying data sources (when dealing with RDBMS or NoSQL sources).

How Dremio handles SQL queries

Client applications issue SQL queries to Dremio over ODBC, JDBC, or REST. A query might involve one or more datasets, potentially residing in different data sources. For example, a query may be a join between a Hive table, Elasticsearch, and several Oracle tables.

Dremio utilizes two primary techniques to reduce the amount of processing required for a query:

  • Push-downs into the underlying data source. The optimizer will consider the capabilities of the underlying data source and the relative costs. It will then generate a plan that performs stages of the query either in the source or in Dremio’s distributed execution environment to achieve the most efficient overall plan possible.
  • Acceleration via Data Reflections. The optimizer will use Data Reflections for portions of the query when this produces the most efficient overall plan. In many cases the entire query can be serviced from Data Reflections as they can be orders of magnitude more efficient than processing queries in the underlying data source.

Query push-downs

Dremio is able to push down processing into relational and non-relational data sources. Non-relational data sources typically do not support SQL and have limited execution capabilities. A file system, for example, cannot apply predicates or aggregations. MongoDB, on the other hand, can apply predicates and aggregations, but does not support all joins. The Dremio optimizer understands the capabilities of each data source. When it is most efficient, Dremio will push as much of a query to the underlying source as possible, and performs the rest in its own distributed execution engine.

Offloading operational databases

Most operational databases are designed for write-optimized workloads. Furthermore, these deployments must address stringent SLAs, as any downtime or degraded performance can significantly impact the business. As a result, operational systems are frequently isolated from processing analytical queries. In these cases Dremio can execute analytical queries using Data Reflections, which provide the most efficient query processing possible while minimizing the impact on the operational system. Data Reflections are updated periodically based on policies that can be configured on a table by table basis.

Query execution phases

The life of a query includes the following phases:

  1. Client submits query to coordinator via ODBC/JDBC/REST
  2. Planning
    1. Coordinator parses query into Dremio’s universal relational model
    2. Coordinator considers available statistics on data sources to develop query plan, as well as functional abilities of the source
  3. Coordinator rewrites query plan to use
    1. the available Data Reflections, considering ordering, partitioning, and distribution of the Data Reflections and
    2. the available capabilities of the data source
  4. Execution
  1. Executors read data into Arrow buffers from sources in parallel
    1. Executors execute the rewritten query plan.
    2. One executor merges the results from one or more executors and streams the final results to the coordinator
  1. Client receives the results from the coordinator

Note that the data may come from Data Reflections or the underlying data source(s). When reading from a data source, the executor submits the native queries (e.g. MongoDB MQL, Elasticsearch Query DSL, Microsoft Transact-SQL) as determined by the optimizer in the planning phase.

All data operations are performed on the executor node, enabling the system to scale to many concurrent clients using only a few coordinator nodes.

Example query push-down

To illustrate how Data Fabric fits into your data architecture, let’s take a closer look at running a SQL query on a source that doesn’t support SQL.

One of the more popular modern data sources is Elasticsearch. There is a lot to like about Elasticsearch, but in terms of analytics it doesn’t support SQL (including SQL joins). That means tools like Tableau and Excel can’t be used to analyze data from applications built on this data store. There is a visualization project called Kibana that is popular for Elasticsearch, but Kibana is designed for developers. It’s not really for business users.

Dremio makes it easy to analyze data in Elasticsearch with any SQL-based tool, including Tableau. Let’s take for example the following SQL query for Yelp business data, which is stored in JSON:

SELECT state, city, name, review_count
FROM elastic.yelp.business
WHERE
  state NOT IN (‘TX’,’UT’,’NM’,’NJ’) AND
  review_count > 100
ORDER BY review_count DESC, state, city
LIMIT 10

Dremio compiles the query into an expression that Elasticsearch can process:

{
"from" : 0,
"size" : 4000,
"query" : {
"bool" : {
"must" : [ {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "TX",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "UT",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "NM",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "NJ",
"type" : "boolean"
}
}
}
}
}, {
"range" : {
"review_count" : {
"from" : 100,
"to" : null,
"include_lower" : false,
"include_upper" : true
}
}
} ]
}
}
}

There’s really no limit to the SQL that can be executed on Elasticsearch or any supported data source with Dremio. Here is a slightly more complex example that involves a windowing expression:

SELECT
city,
name,
bus_review_count,
bus_avg_stars,
city_avg_stars,
all_avg_stars
FROM (
SELECT
city,
name,
bus_review_count,
bus_avg_stars,
AVG(bus_avg_stars) OVER (PARTITION BY city) AS city_avg_stars,
AVG(bus_avg_stars) OVER () AS all_avg_stars,
SUM(bus_review_count) OVER () AS total_reviews
FROM (
SELECT
city,
name,
AVG(review.stars) AS bus_avg_stars,
COUNT(review.review_id) AS bus_review_count
FROM
elastic.yelp.business AS business
LEFT OUTER JOIN elastic.yelp.review AS review ON business.business_id = review.business_id
GROUP BY
city, name
)
)
WHERE bus_review_count > 100
ORDER BY bus_avg_stars DESC, bus_review_count DESC

This query asks how top-rated businesses compare to other businesses in each city. It looks at the average review for each business with more than 100 reviews compared to the average for all businesses in the same city. To perform this query, data from two different datasets in Elasticsearch must be joined together, an action that Elasticsearch doesn’t support. Parts of the query are compiled into expressions Elasticsearch can process, and the rest of the query is evaluated in Dremio’s distributed SQL execution engine.

If we were to create a Data Reflection on one of these datasets, Dremio’s query planner would automatically rewrite the query to use the Data Reflection instead of performing this push-down operation. The user wouldn’t need to change their query or connect to a different physical resource. They would simply experience reduced latency, sometimes by as much as 1000x less depending on the source and complexity of the query.

An open source, industry standard data platform

Analysis and data science is about iterative investigation and exploration of data. Regardless of the complexity and scale of today’s datasets, analysts need to make fast decisions and iterate, without waiting for IT to provide or prepare the data.

To deliver true self-sufficiency, a self-service data fabric should be expected to deliver data faster than the underlying infrastructure. It must understand how to cache various representations of the data in analytically optimized formats and pick the right representations based on freshness expectations and performance requirements. And it must do all of this in a smart way, without relying on explicit knowledge management and sharing.

Data Reflections are a sophisticated way to cache representations of data across many sources, applying multiple techniques to optimize performance and resource consumption. Through Data Reflections, Dremio allows any user’s interaction with any dataset (virtual or physical) to be autonomously routed through sophisticated algorithms.

As the number and variety of data sources in your organization continue to grow, investing and relying on a new tier in your data stack will become necessary. You will need to find a solution built on open source technology, that itself has an open source core that is built on industry standard technologies. Dremio provides a powerful execution and persistence layer built upon Apache Arrow, Apache Calcite, and Apache Parquet, three key pillars for the next generation of data platforms.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (Adobe), and aQuantive (Microsoft).

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.

Source: InfoWorld Big Data

Apache PredictionIO: Easier machine learning with Spark

Apache PredictionIO: Easier machine learning with Spark

The Apache Foundation has added a new machine learning project to its roster, Apache PredictionIO, an open-sourced version of a project originally devised by a subsidiary of Salesforce.

What PredictionIO does for machine learning and Spark

Apache PredictionIO is built atop Spark and Hadoop, and serves Spark-powered predictions from data using customizable templates for common tasks. Apps send data to PredictionIO’s event server to train a model, then query the engine for predictions based on the model.

Spark, MLlib, HBase, Spray, and and Elasticsearch all come bundled with PredictionIO, and Apache offers supported SDKs for working in Java, PHP, Python, and Ruby. Data can be stored in a variety of back ends: JDBC, Elasticsearch, HBase, HDFS, and their local file systems are all supported out of the box. Back ends are pluggable, so a developer can create a custom back-end connector.

How PredictionIO templates make it easier to serve predictions from Spark

PredictionIO’s most notable advantage is its template system for creating machine learning engines. Templates reduce the heavy lifting needed to set up the system to serve specific kinds of predictions. They describe any third-party dependencies that might be needed for the job, such as the Apache Mahout machine-learning app framework.

Some existing templates include:

Some templates also integrate with other machine learning products. For example, two of the prediction templates currently in PredictionIO’s gallery, for churn rate detection and general recommendations, use H2O.ai’s Sparkling Water enhancements for Spark.

PredictionIO can also automatically evaluate a prediction engine to determine the best hyperparameters to use with it. The developer needs to choose and set metrics for how to do this, but there’s generally less work involved in doing this than in tuning hyperparameters by hand.

When running as a service, PredictionIO can accept predictions singly or as a batch. Batched predictions are automatically parallelized across a Spark cluster, as long as the algorithms used in a batch prediction job are all serializable. (PredictionIO’s default algorithms are.)

Where to download PredictionIO

PredictionIO’s source code is available on GitHub. For convenience, various Docker images are available, as well as a Heroku build pack.

Source: InfoWorld Big Data

R tutorial: Learn to crunch big data with R

R tutorial: Learn to crunch big data with R

A few years ago, I was the CTO and cofounder of a startup in the medical practice management software space. One of the problems we were trying to solve was how medical office visit schedules can optimize everyone’s time. Too often, office visits are scheduled to optimize the physician’s time, and patients have to wait way too long in overcrowded waiting rooms in the company of people coughing contagious diseases out their lungs.

One of my cofounders, a hospital medical director, had a multivariate linear model that could predict the required length for an office visit based on the reason for the visit, whether the patient needs a translator, the average historical visit lengths of both doctor and patient, and other possibly relevant factors. One of the subsystems I needed to build was a monthly regression task to update all of the coefficients in the model based on historical data. After exploring many options, I chose to implement this piece in R, taking advantage of the wide variety of statistical (linear and nonlinear modeling, classical statistical tests, time-series analysis, classification, clustering) and graphical techniques implemented in the R system.

One of the attractions for me was the R scripting language, which makes it easy to save and rerun analyses on updated data sets; another attraction was the ability to integrate R and C++. A key benefit for this project was the fact that R, unlike Microsoft Excel and other GUI analysis programs, is completely auditable.

Alas, that startup ran out of money not long after I implemented a proof-of-concept web application, at least partially because our first hospital customer had to declare Chapter 7 bankruptcy. Nevertheless, I continue to favor R for statistical analysis and data science.

Source: InfoWorld Big Data

IDG Contributor Network: Your analytics strategy is obsolete

IDG Contributor Network: Your analytics strategy is obsolete

In the information age, it’s the data-driven bird that gets the worm. Giant companies like Google, Facebook, and Apple hoard data, because it’s the information equivalent of gold.

But merely hoarding data isn’t enough. You need to be adept at sifting through, tying together, and making sense of all the data spilling out of your data lakes. Only then can you act on data to make better decisions and build smarter products.

Yet in the crowded and overfunded analytics market, seeing through the stupefying vendor smog can be all but impossible. To help you make sense of the vast and confusing analytics space, I’ve put together a list of my top predictions for the next five years.

With any luck, these predictions will help you steer your organization toward data-driven bliss.

1. BI migrates into apps

For the past 20 years, we’ve been witnessing a revolution. Not the kind that happens overnight, but the kind that happens gradually. So slowly, in fact, you may not have noticed.

BI is dying. Or more precisely, BI is transmogrifying.

Tableau, a 20-year-old company, was the last “BI” company to sprout a unicorn horn. And let’s be honest, Tableau is not really a bread-and-butter BI solution—it’s a data visualization tool that acquired enough BI sparkle to take on the paleolithic Goliaths that formerly dominated the industry.

Every year, users are gorging themselves on more and more analytics through the apps they use, like HubSpot, SalesForce, and MailChimp. Analytics is migrating into the very fabric of the business applications.

In essence, business applications are acquiring their own analytics interfaces, custom-tailored to their data and their use cases. This integration and customization makes the analytic interfaces more accessible to users than esoteric, complex general-purpose BI (though at the cost of increasing data silos and making the big picture harder to see).

This trend will continue as B2B apps everywhere start competing on data intelligence offerings (those chintzy one-page analytics dashboards are a relic of the past).

2. Compilers over engines

Historically, fresh and tasty analytics were served up two ways: by precomputation (when common aggregations are precomputed and stored in-memory, like in OLAP engines), or by analytics engines (including analytic databases like Teradata and Vertica).

Analytics engines, like Spark and the data engine in Tableau, are responsible for performing the computation required to answer key questions over an organization’s data.

Now there’s a new player on the scene: the analytics compiler. Analytic compilers can flexibly deploy computations to different infrastructures. Examples of analytic compilers include the red hot TensorFlow, which can deploy computations to GPUs or CPUs, Drill, and Quasar Analytics.

Compilers are vastly more flexible than engines because they can take number-crunching recipes and translate them to run in different infrastructures (in-database, on Spark, in a GPU, whatever!). Compilers can also, in theory, generate workflows that run way faster than any interpreted engine.

Even Spark has been acquiring basic compilation facilities, which is a sure sign that compilers are here to stay, and may eventually eclipse legacy pure computational engines.

3. ETL diversifies

Few data acronyms can strike more fear into the hearts of executives than the dreaded “ETL.” Extract-transform-load is the necessary evil by which piles of incomplete, duplicated, unrelated, messy slop is pulled out, cleaned up, and shoved into somewhere the data Vulcans can mind-meld with it.

ETL is the antithesis of modern, agile, and data-driven. ETL means endlessly replicated data, countless delays, and towering expenses. It means not being able to answer the questions that matter when they matter.

In an attempt to make ETL more agile, the industry has developed a variety of alternatives, most heavily funded at the moment by venture capital. These solutions range from high-level ETL tools that make it easier to do ETL into Hadoop or a data warehouse, to streaming ETL solutions, to ETL solutions that leverage machine learning to cross-reference and deduplicate.

Another very interesting class of technology includes tools like Dremio and Xcalar, which reimagine ETL as extract-load-transform (or ELT). In essence, they push transformation to the end and make it lazy, so you don’t have to do any upfront extraction, loading, or transformation.

Historically, ELT has been slow, but these next-generation solutions make ELT fast by dynamically reshaping, indexing, and caching common transformations. This gives you the performance of traditional ETL, with the flexibility of late-stage transformations.

No matter how you slice it, ETL is undergoing dramatic evolution that will make it easier than ever for organizations to rapidly leverage data without time-consuming and costly upfront investments in IT.

4. Data silos open up

The big problems at big organizations don’t really involve fancy analytics. Most companies can’t even add up and count their data. Not because sums and counts are hard, but because data in a modern organization is fragmented and scattered in ten thousand silos.

Thanks to the cloud (including the API revolution and managed data solutions) and recent advances in ETL, it’s becoming easier than ever for organizations to access more of their data in a structured way.

Next-generation data management solutions will play an important role in leveraging these technological advances to make all of an organization’s data analytically accessible to all the right people in a timely fashion.

5. Machine learning gets practical

Machine learning is just past the peak of the hype cycle. Or at least we can hope so. Unnamed tech celebrities who don’t understand how machine learning works continue to rant about doomsday Terminator scenarios, even while consumers can’t stop joking about how terrible Siri is.

Machine learning suffers from a fatal combination of imperfection and inculpability. When machine learning goes wrong (as it often and inevitably does), there’s no one to blame, and no one to learn from the mistake.

That’s an absolute no-no for any kind of mission-critical analytics.

So until we are able to train artificial minds on the entirety of knowledge absorbed by society’s brightest, the magical oracle that can answer any question over the data of a business is very far off. Much farther than five years.

Until then, we are likely to see very focused applications of machine learning. For example, ThoughtSpot’s natural language interface to BI; black-box predictive analytics for structured data sets; and human-assistive technology that lets people see connections between different data sources, correct common errors, and spot anomalies.

These aren’t the superbrains promised in science fiction, but they will make it easier for users to figure out what questions to ask and help guide them toward finding correct answers.

While analytics is a giant market and filled with confusing marketing speak, there are big trends shaping the industry that will dictate where organizations invest.

These trends include the ongoing migration of data intelligence into business applications, the advent of analytic compilers that can deploy workflows to ad hoc infrastructure, the rapidly evolving state of ETL, the increased accessibility of data silos to organizations, and the very pragmatic if unsensational ways that machine learning is improving analytics tools.

These overarching trends for the next five years will ripple into the tools that organizations adopt, the analytic startups that get funded, the acquisitions that existing players make, and the innovation that we see throughout the entire analytic stack, from data warehouse to visual analytics front-ends.

When figuring out what your data architecture and technology stack should look like, choose wisely, because the industry is in the process of reinvention, and few stones will be left unturned.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

IDG Contributor Network: ETL is dead

IDG Contributor Network: ETL is dead

Extract, transform, and load. It doesn’t sound too complicated. But, as anyone who’s managed a data pipeline will tell you, the simple name hides a ton of complexity.

And while none of the steps are easy, the part that gives data engineers nightmares is the transform. Taking raw data, cleaning it, filtering it, reshaping it, summarizing it, and rolling it up so that it’s ready for analysis. That’s where most of your time and energy goes, and it’s where there’s the most room for mistakes.

If ETL is so hard, why do we do it this way?

The answer, in short, is because there was no other option. Data warehouses couldn’t handle the raw data as it was extracted from source systems, in all its complexity and size. So the transform step was necessary before you could load and eventually query data. The cost, however, was steep.

Rather than maintaining raw data that could be transformed into any possible end product, the transform shaped your data into an intermediate form that was less flexible. You lost some of the data’s resolution, imposed the current version of your business’ metrics on the data, and threw out useless data.

And if any of that changed—if you needed hourly data when previously you’d only processed daily data, if your metric definitions changed, or some of that “useless” data turned out to not be so useless after all—then you’d have to fix your transformation logic, reprocess your data, and reload it.

The fix might take days or weeks

It wasn’t a great system, but it’s what we had.

So as technologies change and prior constraints fall away, it’s worth asking what we would do in an ideal world—one where data warehouses were infinitely fast and could handle data of any shape or size. In that world, there’d be no reason to transform data before loading it. You’d extract it and load it in its rawest form.

You’d still want to transform the data, because querying low-quality, dirty data isn’t likely to yield much business value. But your infinitely fast data warehouse could handle that transformation right at query time. The transformation and query would all be a single step. Think of it as just-in-time transformation. Or ELT.

The advantage of this imaginary system is clear: You wouldn’t have to decide ahead of time which data to discard or which version of your metric definitions to use. You’d always use the freshest version of your transformation logic, giving you total flexibility and agility.

So, is that the world we live in? And if so, should we switch to ELT?

Not quite. Data warehouses have indeed gotten several orders of magnitude faster and cheaper. Transformations that used to take hours and cost thousands of dollars now take seconds and cost pennies. But they can still get bogged down with misshapen data or huge processes.

So there’s still some transformation that’s best accomplished outside the warehouse. Removing irrelevant or dirty data, and doing heavyweight reshaping, is still often a preloading process. But this initial transform is a much smaller step and thus much less likely to need updating down the road.

Basically, it’s gone from a big, all-encompassing ‘T’ to a much smaller ‘t’

Once the initial transform is done, it’d be nice to move the rest of the transform to query time. But especially with larger data volumes, the data warehouses still aren’t quite fast enough to make that workable. (Plus, you still need a good way to manage the business logic and impose it as people query.)

So instead of moving all of that transformation to query time, more and more companies are doing most of it in the data warehouse—but they’re doing it immediately after loading. This gives them lots more agility than in the old system, but maintains tolerable performance. For now, at least, this is where the biggest “T” is happening.

The lightest-weight transformations—the ones the warehouses can do very quickly—are happening right at query time. This represents another small “t,” but it has a very different focus than the preloading “t.” That’s because these lightweight transformations often involve prototypes of new metrics and more ad hoc exploration, so the total flexibility that query-time transformation provides is ideal.

In short, we’re seeing a huge shift that takes advantage of new technologies to make analytics more flexible, more responsive, and more performant. As a result, employees are making better decisions using data that was previously slow, inaccessible, or worst of all, wrong. And the companies that embrace this shift are outpacing rivals stuck in the old way of doing things.

ETL? ETL is dead. But long live … um … EtLTt?

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data

Cisco Updates Its SDN Solution

Cisco Updates Its SDN Solution

Cisco has announced updates to its Application Centric Infrastructure (Cisco ACI™), a software-defined networking (SDN) solution deigned to make it easier for customers to adopt and advance intent-based networking for their data centers. With the latest software release (ACI 3.0), more than 4,000 ACI customers can increase business agility with network automation, simplified management, and improved security for any combination of workloads in containers, virtual machines and bare metal for private clouds, and on-premise data centers.

The transitions occurring in the data center are substantial. Enterprises experience an unrelenting need to accelerate speed, flexibility, security and scale across increasingly complex data centers and multi-cloud environments.

“As our customers shift to multi-cloud strategies, they are seeking ways to simplify the management and scalability of their environments,” said Ish Limkakeng, senior vice president for data center networking at Cisco. “By automating basic IT operations with a central policy across multiple data centers and geographies, ACI’s new multi-site management capability helps network operators more easily move and manage workloads with a single pane of glass — a significant step in delivering on Cisco’s vision for enabling ACI Anywhere.”

The new ACI 3.0 software release is now available. New features include:

Multi-site Management: Customers can seamlessly connect and manage multiple ACI fabrics that are geographically distributed to improve availability by isolating fault domains, and provide a global view of network policy through a single management portal. This greatly simplifies disaster recovery and the ability to scale out applications.

Kubernetes Integration: Customers can deploy their workloads as micro-services in containers, define ACI network policy for these through Kubernetes, and get unified networking constructs for containers, virtual machines, and bare-metal. This brings the same level of deep integration to containers ACI has had with numerous hypervisors.

Improved Operational Flexibility and Visibility: The new Next Gen ACI User Interface improves usability with new consistent layouts and simplified topology views, and troubleshooting wizards. In addition, ACI now includes graceful insertion and removal, support for mixed operating systems and quota management, and latency measurements across fabric end points for troubleshooting.

Security: ACI 3.0 delivers new capabilities to protect networks by mitigating attacks such as IP/MAC spoofing with First Hop Security integration, automatically authenticating workloads in-band and placing them in trusted security groups, and support for granular policy enforcement for end points within the same security group.

“With ‘ACI Anywhere,’ Cisco is delivering a scalable solution that will help position customers for success in multi-cloud and multi-site environments,” said Dan Conde, an analyst with Enterprise Strategy Group. “ACI’s new integration with container cluster managers and its enhancements to zero trust security make this a modern offering for the market, whether you are a large Service Provider, Enterprise, or a commercial customer.”

Source: CloudStrategyMag

UKCloud Launches Cloud GPU Services

UKCloud Launches Cloud GPU Services

UKCloud has announced the launch of its Cloud GPU computing service based on NVIDIA virtual GPU solutions with NVIDIA Tesla P100 and M60 GPUs (graphics processing units). The service will support computational and visualisation intensive workloads for UKCloud’s UK public sector and health care customers. UKCloud is not only the first Cloud Service Provider based in the UK or Europe to offer Cloud GPU computing services with NVIDIA GPUs, but is also the only provider specialising in public sector and health care and the specific needs of these customers.

“Building on the foundation of UKCloud’s secure, assured, UK-Sovereign platform, we are now able to offer a range of cloud-based compute, storage and GPU services to meet our customers’ complex workload requirements,” said Simon Hansford, CEO, UKCloud. “The public sector is driving more complex computational and visualisation intensive workloads than ever before, not only for CAD development packages, but also for tasks like the simulation of infrastructure changes in transport, for genetic sequencing in health or for battlefield simulation in defence. In response to this demand, we have a greater focus on emerging technologies such as deep learning, machine learning and artificial intelligence.”

Many of today’s modern applications, especially in fields such as medical imaging or graphical analytics, need an NVIDIA GPU to power them, whether they are running on a laptop or desktop, on a departmental server or on the cloud. Just as organisations are finding that their critical business applications can be run more securely and efficiently in the cloud, so too they are realising that it makes sense to host graphical and visualisation intensive workloads there as well.

Adding cloud GPU computing services utilising NVIDIA technology to support more complex computational and visualisation intensive workloads was a customer requirement captured via UKCloud Ideas, a service that was introduced as part of UKCloud’s maniacal focus on customer service excellence. UKCloud Ideas proactively polls its clients for ideas and wishes for service improvements, enabling customers to vote on ideas and drive product improvements across the service. This has facilitated more than 40 feature improvements in the last year across UKCloud’s service catalogue from changes to the customer portal to product specific improvements.

One comment came from a UKCloud partner with many clients needing GPU capability: “One of our applications includes 3D functionality which requires a graphics card. We have several customers who might be interested in a hosted solution but would require access to this functionality. To this end it would be helpful if UKCloud were able to offer us a solution which included a GPU.”

Listening to its clients in this way and acting on their suggestions to improve its service by implementing NVIDIA GPU technology was one of a number of initiatives that enabled UKCloud to win a 2017 UK Customer Experience Award for putting customers at the heart of everything, through the use of technology.

“The availability of NVIDIA GPUs in the cloud means businesses can capitalise on virtualisation without compromising the functionality and responsiveness of their critical applications,” added Bart Schneider, Senior Director of CSP EMEA at NVIDIA. “Even customers running graphically complex or compute-intensive applications can benefit from rapid turn-up, service elasticity and cloud-economics.”

UKCloud’s GPU-accelerated cloud service, branded as Cloud GPU, is available in two versions: Compute and Visualisation. Both are based on NVIDIA GPUs and initially available only on UKCloud’s Enterprise Compute Cloud platform. They will be made available on UKCloud’s other platforms at a later date. The two versions are as follows:

  • UKCloud’s Cloud GPU Compute: This is a GPU accelerated computing service, based on the NVIDIA Tesla P100 GPU and supports applications developed using NVIDIA CUDA, that enables parallel co-processing on both the CPU and GPU. Typical use cases include looking for cures, trends and research findings in medicine along with genomic sequencing, data mining and analytics in social engineering, and trend identification and predictive analytics in business or financial modelling and other applications of AI and deep learning. Available from today with all VM sizes, Cloud GPU Compute will represent an additional cost of £1.90 per GPU per hour on top of the cost of the VM.
  • UKCloud’s Cloud GPU Visualisation: This is a virtual GPU (vGPU) service, utilising the NVIDIA Tesla M60, that extends the power of NVIDIA GPU technology to virtual desktops and apps. In addition to powering remote workspaces, typical use cases include military training simulations and satellite image analysis in defence, medical imaging and complex image rendering. Available from the end of October with all VM sizes, Cloud GPU Visualisation will represent an additional cost of £0.38 per vGPU per hour on top of the cost of the VM.

UKCloud has also received a top accolade from NVIDIA, that of ‘2017 Best Newcomer’ in the EMEA partner awards that were announced at NVIDIA’s October GPU Technology Conference 2017 in Munich. UKCloud was commended for making GPU technology more accessible for the UK public sector. As the first European Cloud Service Provider with NVIDIA GPU Accelerated Computing, UKCloud is helping to accelerate the adoption of Artificial Intelligence across all areas of the public sector, from central and local government to defence and healthcare, by allowing its customers and partners to harness the awesome power of GPU compute, without having to build specific rigs.

Source: CloudStrategyMag

Alibaba Cloud Joins Red Hat Certified Cloud And Service Provider Program

Alibaba Cloud Joins Red Hat Certified Cloud And Service Provider Program

Red Hat, Inc. and Alibaba Cloud have announced that they will join forces to bring the power and flexibility of Red Hat’s open source solutions to Alibaba Cloud’s customers around the globe.

Alibaba Cloud is now part of the Red Hat Certified Cloud and Service Provider program, joining a group of technology industry leaders who offer Red Hat-tested and validated solutions that extend the functionality of Red Hat’s broad portfolio of open source cloud solutions. The partnership extends the reach of Red Hat’s offerings across the top public clouds globally, providing a scalable destination for cloud computing and reiterating Red Hat’s commitment to providing greater choice in the cloud.

“Our customers not only want greater performance, flexibility, security and portability for their cloud initiatives; they also want the freedom of choice for their heterogeneous infrastructures. They want to be able to deploy their technologies of choice on their scalable infrastructure of choice. That is Red Hat’s vision and the focus of the Red Hat Certified Cloud and Service Provider Program. By working with Alibaba Cloud, we’re helping to bring more choice and flexibility to customers as they deploy Red Hat’s open source solutions across their cloud environments,” said Mike Ferris, vice president, technical business development and business architecture, Red Hat.

In the coming months, Red Hat solutions will be available directly to Alibaba Cloud customers, enabling them to take advantage of the full value of Red Hat’s broad portfolio of open source cloud solutions. Alibaba Cloud intends to offer Red Hat Enterprise Linux in a pay-as-you-go model in the Alibaba Cloud Marketplace.

By joining the Red Hat Certified Cloud and Service Provider program, Alibaba Cloud has signified that it is a destination for Red Hat customers, independent software vendors (ISVs) and partners to enable them to benefit from Red Hat offerings in public clouds. These will be provided under innovative consumption and service models with the greater confidence that Red Hat product experts have validated the solutions.

“As enterprises in China, and throughout the world, look to modernize application environments, a full-lifecycle solution by Red Hat on Alibaba Cloud can provide customers higher flexibility and agility. We look forward to working with Red Hat to help enterprise customers with their journey of scaling workloads to Alibaba Cloud.,” said Yeming Wang, deputy general manager of Alibaba Cloud Global, Alibaba Cloud.

Launched in 2009, the Red Hat Certified Cloud and Service Provider Program is designed to assemble the solutions cloud providers need to plan, build, manage, and offer hosted cloud solutions and Red Hat technologies to customers. The Certified Cloud Provider designation is awarded to Red Hat partners following validation by Red Hat. Each provider meets testing and certification requirements to demonstrate that they can deliver a safe, scalable, supported, and consistent environment for enterprise cloud deployments.

In addition, in the coming months, Red Hat customers will be able to move eligible, unused Red Hat subscriptions from their datacenter to Alibaba Cloud, China’s largest public cloud service provider, using Red Hat Cloud Access. Red Hat Cloud Access is an innovative “bring-your-own-subscription” benefit available from select Red Hat Certified Cloud and Service Providers that enables customers to move eligible Red Hat subscriptions from on-premise to public clouds. Red Hat Cloud Access also enables customers to maintain a direct relationship with Red Hat – including the ability to receive full support from Red Hat’s award-winning Global Support Services organization, enabling customers to maintain a consistent level of service across certified hybrid deployment infrastructures.

Source: CloudStrategyMag

Edgeconnex® Enables Cloudflare Video Streaming Service

Edgeconnex® Enables Cloudflare Video Streaming Service

EdgeConneX® has announced a new partnership with Cloudflare to enable and deploy its new Cloudflare Stream service. The massive Edge deployment will roll out in 18 Edge Data Centers® (EDCs) across North America and Europe, enabling Cloudflare to bring data within a few milliseconds of local market endusers and providing fast and effective delivery of bandwidth-intensive content.

Cloudflare powers more than 10% of all Internet requests and ensures that web properties, APIs and applications run efficiently and stay online. On September 27, 2017, exactly seven years after the company’s launch, Cloudflare expanded its offerings with Cloudflare Stream, a new service that combines encoding and global delivery to form a solution for the technical and business issues associated with video streaming. By deploying Stream at all of Cloudflare’s edge nodes, Cloudflare is providing customers the ability to integrate high-quality, reliable streaming video into their applications.

In addition to the launch of Stream, Cloudflare is rolling out four additional new services: Unmetered Mitigation, which eliminates surge pricing for DDoS mitigation; Geo Key Manager, which provides customers with granular control over where they place their private keys; Cloudflare Warp, which eliminates the effort required to fully mask and protect an application; and Cloudflare Workers, which writes and deploys JavaScript code at the edge. As part of its ongoing global expansion, Cloudflare is launching with EdgeConneX to serve more customers with fast and reliable web services.

“We think video streaming will be a ubiquitous component within all websites and apps in the future, and it’s our immediate goal to expand the number of companies that are streaming video from 1,000 to 100,000,” explains Matthew Prince, co-founder and CEO, Cloudflare. “Combined with EdgeConneX’s portfolio of Edge Data Centers, our technology enables a global solution across all 118 of our points of presence, for the fastest and most secure delivery of video and Internet content.”

In order to effectively deploy its services, including the newly launched Stream solution, Cloudflare is allowing customers to run basic software at global facilities located at the Edge of the network. To achieve this, Cloudflare has selected EdgeConneX to provide fast and reliable content delivery to end users. When deploying Stream and other services in EDCs across North America and Europe, Cloudflare will utilize this massive Edge deployment to further enhance its service offerings.

Cloudflare’s performance gains from EdgeConneX EDCs have been verified by Cedexis, the leader in latency-based load balancing for content and cloud providers. Their panel of Real User Measurement data showed significant response time improvements immediately following the EdgeConneX EDC deployments — 33% in the Minneapolis metro area and 20% in the Portland metro area.

“When it comes to demonstrating the effectiveness of storing data at an EdgeConneX EDC, the numbers speak for themselves,” says Clint Heiden, chief commercial officer, EdgeConneX. “We look forward to continuing our work with Cloudflare to help them deliver a wide range of cutting-edge services to their customer base, including Cloudflare Stream.”

Source: CloudStrategyMag

IDG Contributor Network: AI and quantum computing: technology that's fueling innovation and solving future problems

IDG Contributor Network: AI and quantum computing: technology that's fueling innovation and solving future problems

Two weeks ago, I spent time in Orlando, Florida, attending Microsoft’s huge IT pro and developer conference known as Microsoft Ignite. Having the opportunity to attend events such as this to see the latest in technological advancements is one of the highlights of my job. Every year, I am amazed at what new technologies are being made available to us. The pace of innovation has increased exponentially over the last five years. I can only imagine what the youth of today will bring to this world as our next generation’s creators.

Microsoft’s CEO, Satya Nadella, kicked off the vision keynote on Day 1. As always, he gets the crowd pumped up with his inspirational speeches. If you saw Satya’s keynote last year, you could almost bet on what he was going to be talking about this year. His passion, and Microsoft’s mission, is to empower every person and every organization on the planet to achieve more. This is a bold statement, but one that I believe is possible. He also shared how Microsoft is combining cloud, artificial intelligence, and mixed reality across their product portfolio to help customers innovate and build the future of business. This was followed by a demonstration of how Ford Motor was able to use these technologies to improve product design and engineering and innovate at a much faster pace today. It’s clear to me that AI is going to be a core part of our lives as we continue to evolve with this technology.

The emergence of machine learning business models based on the use of the cloud is in fact a big factor for why AI is taking off. Prior to the cloud, AI projects had high costs abut cloud economics have rendered certain machine learning capabilities relatively inexpensive and less complicated to operate. Thanks to the integration of cloud and AI, very specialized artificial intelligence startups are exploding in growth. Besides the aforementioned Microsoft, AI projects and initiatives at tech giants such as Facebook, Google, and Apple are also exploding.

As we move forward, the potential for these technologies to help people in ways that we have never been able to before is going to become more of a reality than a dream. Technologies such as AI, serverless computing, containers, augmented reality, and, yes, quantum computing will fundamentally change how we do things and fuel innovation at a pace faster than ever before.

One of the most exciting moments that had everyone’s attention at Ignite was when Microsoft shared what it has been doing around quantum computing. We’ve heard about this, but is it real? The answer is yes. Other influential companies such as IBM and Google are investing resources in this technology as well. It’s quite complex but very exciting. To see a technology like this come to fruition and make an impact in my lifetime would be nothing short of spectacular.

Moore’s Law states the number of transistors on a microprocessor will double every 18 months. Today, traditional computers store data as binary digits represented by either a 1 or 0 to signify a state of on or off. With this model, we have come a long way from the early days of computing power, but there is still a need for even faster and more powerful processing. Intel is already working with 10-nanometer manufacturing process technology, code-named Cannon Lake, that will offer reduced power consumption, higher density, and increased performance. In the very near future circuits will have to be measured on an atomic scale. This is where quantum computing comes in.

I’m not an expert in this field, but I have taken an interest in this technology as I have a background in electronics engineering. In simple terms—quantum computing harnesses the power of atoms and molecules to perform memory and processing tasks. Quantum computing is combining the best of math, physics, and computer science using what is referred to as electron fractionalization.

Quantum computers aren’t limited to only two states. They encode information using quantum bits, otherwise known as qubits. This involves being able to store data as both 1s and 0s, known as superposition, at the same time which unlocks parallelism. That probably doesn’t tell you much but think of it this way: This technology could enable us to solve complex problems in hours or days that would normally take billions of years with traditional computers. Think about that for a minute and you will realize just how significant this could be. This could enable researchers to develop and simulate new materials, improve medicines, accelerate AI and solve world hunger and global warming. Quantum computing will help us solve the impossible.

There are some inherent challenges with quantum computing. If you try to look at a qubit you risk bumping it, thereby causing its value to change. Scientists have devised ways to observe these quantum superpositions without destroying them. This is done by using cryogenics to cool the quantum chips down to temperatures in the range of 0.01ºK (–459.65ºF) where there are no vibrations to interfere with measurements.

Soon, developers will be able to test algorithms by running them in a local simulator on your computer, simulating around 30 qubits, or in Azure simulating around 40 quibits. As companies such as Microsoft, Google, and IBM continue to develop technologies such as this, dreams of quantum computing are becoming a reality. This technological innovation is not about who is the first to prove the value of quantum computing. This is about solving real world problems for our future generations in hopes of a better world.

This article is published as part of the IDG Contributor Network. Want to Join?

Source: InfoWorld Big Data