Dremio: Simpler and faster data analytics

Dremio: Simpler and faster data analytics

Now is a great time to be a developer. Over the past decade, decisions about technology have moved from the boardroom to innovative developers, who are building with open source and making decisions based on the merits of the underlying project rather than the commercial relationships provided by a vendor. New projects have emerged that focus on making developers more productive, and that are easier to manage and scale. This is true for virtually every layer of the technology stack. The result is that developers today have almost limitless opportunities to explore new technologies, new architectures, and new deployment models.

Looking at the data layer in particular, NoSQL systems such as MongoDB, Elasticsearch, and Cassandra have pushed the envelope in terms of agility, scalability, and performance for operational applications, each with a different data model and approach to schema. Along the way many development teams moved to a microservices model, spreading application data across many different underlying systems.

In terms of analytics, old and new data sources have found their way into a mix of traditional data warehouses and data lakes, some on Hadoop, others on Amazon S3. And the rise of the Kafka data streaming platform creates an entirely different way of thinking about data movement and analysis of data in motion.

With data in so many different technologies and underlying formats, analytics on modern data is hard. BI and analytics tools such as Tableau, Power BI, R, Python, and machine learning models were designed for a world in which data lives in a single, high-performance relational database. In addition, users of these tools – business analysts, data scientists, and machine learning models – want the ability to access, explore, and analyze data on their own, without any dependency on IT.

Introducing the Dremio data fabric

BI tools, data science systems, and machine learning models work best when data lives in a single, high-performance relational database. Unfortunately, that’s not where data lives today. As a result, IT has no choice but to bridge that gap through a combination of custom ETL development and proprietary products. In many companies, the analytics stack includes the following layers:

  • Data staging. The data is moved from various operational databases into a single staging area such as a Hadoop cluster or cloud storage service (e.g., Amazon S3).
  • Data warehouse. While it is possible to execute SQL queries directly on Hadoop and cloud storage, these systems are simply not designed to deliver interactive performance. Therefore, a subset of the data is usually loaded into a relational data warehouse or MPP database.
  • Cubes, aggregation tables, and BI extracts. In order to provide interactive performance on large datasets, the data must be pre-aggregated and/or indexed by building cubes in an OLAP system or materialized aggregation tables in the data warehouse.

This multi-layer architecture introduces many challenges. It is complex, fragile, and slow, and creates an environment where data consumers are entirely dependent on IT.

Dremio introduces a new tier in data analytics we call a self-service data fabric. Dremio is an open source project that enables business analysts and data scientists to explore and analyze any data at any time, regardless of its location, size, or structure. Dremio combines a scale-out architecture with columnar execution and acceleration to achieve interactive performance on any data volume, while enabling IT, data scientists, and business analysts to seamlessly shape the data according to the needs of the business.

Built on Apache Arrow, Apache Parquet, and Apache Calcite

Dremio utilizes high-performance columnar storage and execution, powered by Apache Arrow (columnar in memory) and Apache Parquet (columnar on disk). Dremio also uses Apache Calcite for SQL parsing and query optimization, building on the same libraries as many other SQL-based engines, such as Apache Hive.

Apache Arrow is an open source project that enables columnar in-memory data processing and interchange. Arrow was created by Dremio, and includes committers from various companies including Cloudera, Databricks, Hortonworks, Intel, MapR, and Two Sigma.

Dremio is the first execution engine built from the ground up on Apache Arrow. Internally, the data in memory is maintained off-heap in the Arrow format, and there will soon be an API that returns query results as Arrow memory buffers.

A variety of other projects have embraced Arrow as well. Python (Pandas) and R are among these projects, enabling data scientists to work more efficiently with data. For example, Wes McKinney, creator of the popular Pandas library, recently demonstrated how Arrow enables Python users to read data into Pandas at over 10 GB/s.

How Dremio enables self-service data

In addition to the ability to work interactively with their datasets, data engineers, business analysts, and data scientists also need a way to curate the data so that it is suitable for the needs of a specific project. This is a fundamental shift from the IT-centric model, where consumers of data initiate a request for a dataset and wait for IT to fulfill their request weeks or months later. Dremio enables a self-service model, where consumers of data use Dremio’s data curation capabilities to collaboratively discover, curate, accelerate, and share data without relying on IT.

All of these capabilities are accessible through a modern, intuitive, web-based UI:

  • Discover. Dremio includes a unified data catalog where users can discover and explore physical and virtual datasets. The data catalog is automatically updated when new data sources are added, and as data sources and virtual datasets evolve. All metadata is indexed in a high-performance, searchable index, and exposed to users throughout the Dremio interface.
  • Curate. Dremio enables users to curate data by creating virtual datasets. A variety of point-and-click transformations are supported, and advanced users can utilize SQL syntax to define more complex transformations. As queries execute in the system, Dremio learns about the data, enabling it to recommend various transformations such as joins and data type conversions.
  • Dremio is capable of accelerating datasets by up to 1000x over the performance of the source system. Users can vote for datasets they think should be faster, and Dremio’s heuristics will consider these votes in determining which datasets to accelerate. Optionally, system administrators can manually determine which datasets to accelerate.
  • Dremio enables users to securely share data with other users and groups. In this model a group of users can collaborate on a virtual dataset that will be used for a particular analytical job. Alternately, users can upload their own data, such as Excel spreadsheets, to join to other datasets from the enterprise catalog. Creators of virtual datasets can determine which users can query or edit their virtual datasets. It’s like Google Docs for your data.

How Dremio data acceleration works

Dremio utilizes highly optimized physical representations of source data called Data Reflections. The Reflection Store can live on HDFS, MapR-FS, cloud storage such as S3, or direct-attached storage (DAS). The Reflection Store size can exceed that of physical memory. This architecture enables Dremio to accelerate more data at a lower cost, resulting in a much higher cache hit ratio compared to traditional memory-only architectures. Data Reflections are automatically utilized by the cost-based optimizer at query time.

Data Reflections are invisible to end users. Unlike OLAP cubes, aggregation tables, and BI extracts, the user does not explicitly connect to a Data Reflection. Instead, users issue queries against the logical model, and Dremio’s optimizer automatically accelerates the query by taking advantage of the Data Reflections that are suitable for the query based on the optimizer’s cost analysis.

When the optimizer cannot accelerate the query, Dremio utilizes its high-performance distributed execution engine, leveraging columnar in-memory processing (via Apache Arrow) and advanced push-downs into the underlying data sources (when dealing with RDBMS or NoSQL sources).

How Dremio handles SQL queries

Client applications issue SQL queries to Dremio over ODBC, JDBC, or REST. A query might involve one or more datasets, potentially residing in different data sources. For example, a query may be a join between a Hive table, Elasticsearch, and several Oracle tables.

Dremio utilizes two primary techniques to reduce the amount of processing required for a query:

  • Push-downs into the underlying data source. The optimizer will consider the capabilities of the underlying data source and the relative costs. It will then generate a plan that performs stages of the query either in the source or in Dremio’s distributed execution environment to achieve the most efficient overall plan possible.
  • Acceleration via Data Reflections. The optimizer will use Data Reflections for portions of the query when this produces the most efficient overall plan. In many cases the entire query can be serviced from Data Reflections as they can be orders of magnitude more efficient than processing queries in the underlying data source.

Query push-downs

Dremio is able to push down processing into relational and non-relational data sources. Non-relational data sources typically do not support SQL and have limited execution capabilities. A file system, for example, cannot apply predicates or aggregations. MongoDB, on the other hand, can apply predicates and aggregations, but does not support all joins. The Dremio optimizer understands the capabilities of each data source. When it is most efficient, Dremio will push as much of a query to the underlying source as possible, and performs the rest in its own distributed execution engine.

Offloading operational databases

Most operational databases are designed for write-optimized workloads. Furthermore, these deployments must address stringent SLAs, as any downtime or degraded performance can significantly impact the business. As a result, operational systems are frequently isolated from processing analytical queries. In these cases Dremio can execute analytical queries using Data Reflections, which provide the most efficient query processing possible while minimizing the impact on the operational system. Data Reflections are updated periodically based on policies that can be configured on a table by table basis.

Query execution phases

The life of a query includes the following phases:

  1. Client submits query to coordinator via ODBC/JDBC/REST
  2. Planning
    1. Coordinator parses query into Dremio’s universal relational model
    2. Coordinator considers available statistics on data sources to develop query plan, as well as functional abilities of the source
  3. Coordinator rewrites query plan to use
    1. the available Data Reflections, considering ordering, partitioning, and distribution of the Data Reflections and
    2. the available capabilities of the data source
  4. Execution
  1. Executors read data into Arrow buffers from sources in parallel
    1. Executors execute the rewritten query plan.
    2. One executor merges the results from one or more executors and streams the final results to the coordinator
  1. Client receives the results from the coordinator

Note that the data may come from Data Reflections or the underlying data source(s). When reading from a data source, the executor submits the native queries (e.g. MongoDB MQL, Elasticsearch Query DSL, Microsoft Transact-SQL) as determined by the optimizer in the planning phase.

All data operations are performed on the executor node, enabling the system to scale to many concurrent clients using only a few coordinator nodes.

Example query push-down

To illustrate how Data Fabric fits into your data architecture, let’s take a closer look at running a SQL query on a source that doesn’t support SQL.

One of the more popular modern data sources is Elasticsearch. There is a lot to like about Elasticsearch, but in terms of analytics it doesn’t support SQL (including SQL joins). That means tools like Tableau and Excel can’t be used to analyze data from applications built on this data store. There is a visualization project called Kibana that is popular for Elasticsearch, but Kibana is designed for developers. It’s not really for business users.

Dremio makes it easy to analyze data in Elasticsearch with any SQL-based tool, including Tableau. Let’s take for example the following SQL query for Yelp business data, which is stored in JSON:

SELECT state, city, name, review_count
FROM elastic.yelp.business
WHERE
  state NOT IN (‘TX’,’UT’,’NM’,’NJ’) AND
  review_count > 100
ORDER BY review_count DESC, state, city
LIMIT 10

Dremio compiles the query into an expression that Elasticsearch can process:

{
"from" : 0,
"size" : 4000,
"query" : {
"bool" : {
"must" : [ {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "TX",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "UT",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "NM",
"type" : "boolean"
}
}
}
}
}, {
"bool" : {
"must_not" : {
"match" : {
"state" : {
"query" : "NJ",
"type" : "boolean"
}
}
}
}
}, {
"range" : {
"review_count" : {
"from" : 100,
"to" : null,
"include_lower" : false,
"include_upper" : true
}
}
} ]
}
}
}

There’s really no limit to the SQL that can be executed on Elasticsearch or any supported data source with Dremio. Here is a slightly more complex example that involves a windowing expression:

SELECT
city,
name,
bus_review_count,
bus_avg_stars,
city_avg_stars,
all_avg_stars
FROM (
SELECT
city,
name,
bus_review_count,
bus_avg_stars,
AVG(bus_avg_stars) OVER (PARTITION BY city) AS city_avg_stars,
AVG(bus_avg_stars) OVER () AS all_avg_stars,
SUM(bus_review_count) OVER () AS total_reviews
FROM (
SELECT
city,
name,
AVG(review.stars) AS bus_avg_stars,
COUNT(review.review_id) AS bus_review_count
FROM
elastic.yelp.business AS business
LEFT OUTER JOIN elastic.yelp.review AS review ON business.business_id = review.business_id
GROUP BY
city, name
)
)
WHERE bus_review_count > 100
ORDER BY bus_avg_stars DESC, bus_review_count DESC

This query asks how top-rated businesses compare to other businesses in each city. It looks at the average review for each business with more than 100 reviews compared to the average for all businesses in the same city. To perform this query, data from two different datasets in Elasticsearch must be joined together, an action that Elasticsearch doesn’t support. Parts of the query are compiled into expressions Elasticsearch can process, and the rest of the query is evaluated in Dremio’s distributed SQL execution engine.

If we were to create a Data Reflection on one of these datasets, Dremio’s query planner would automatically rewrite the query to use the Data Reflection instead of performing this push-down operation. The user wouldn’t need to change their query or connect to a different physical resource. They would simply experience reduced latency, sometimes by as much as 1000x less depending on the source and complexity of the query.

An open source, industry standard data platform

Analysis and data science is about iterative investigation and exploration of data. Regardless of the complexity and scale of today’s datasets, analysts need to make fast decisions and iterate, without waiting for IT to provide or prepare the data.

To deliver true self-sufficiency, a self-service data fabric should be expected to deliver data faster than the underlying infrastructure. It must understand how to cache various representations of the data in analytically optimized formats and pick the right representations based on freshness expectations and performance requirements. And it must do all of this in a smart way, without relying on explicit knowledge management and sharing.

Data Reflections are a sophisticated way to cache representations of data across many sources, applying multiple techniques to optimize performance and resource consumption. Through Data Reflections, Dremio allows any user’s interaction with any dataset (virtual or physical) to be autonomously routed through sophisticated algorithms.

As the number and variety of data sources in your organization continue to grow, investing and relying on a new tier in your data stack will become necessary. You will need to find a solution built on open source technology, that itself has an open source core that is built on industry standard technologies. Dremio provides a powerful execution and persistence layer built upon Apache Arrow, Apache Calcite, and Apache Parquet, three key pillars for the next generation of data platforms.

Jacques Nadeau is the CTO and cofounder of Dremio. Jacques is also the founding PMC chair of the open source Apache Drill project, spearheading the project’s technology and community. Prior to Dremio, he was the architect and engineering manager for Drill and other distributed systems technologies at MapR. In addition, Jacques was CTO and cofounder of YapMap, an enterprise search startup, and held engineering leadership roles at Quigo (AOL), Offermatica (Adobe), and aQuantive (Microsoft).

New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum@infoworld.com.


Source: InfoWorld Big Data