Thanks to Jesus Centeno of Qlik for the post below about using Impala alongside Qlik Sense.
Cloudera and Qlik (which is part of the Impala Accelerator Program) have revolutionized the delivery of insights and value to every business stakeholder for “small data,” to something more powerful in the Big Data world—enabling users to combine Big Data and “small data” to yield actionable business insights.
In this post, you’ll learn how Qlik Sense interacts with HDFS via a Cloudera enterprise data hub. This document can be also used as a quick guide on how to conduct a fast and easy-to-set-up evaluation of Qlik Sense and Cloudera.
Qlik’s primary way to leverage CDH as a source is via ODBC. Qlik leverages the free Apache Hive ODBC connector and Impala ODBC connector, both of which have been certified with Qlik (Version Compatibility tab):
The Cloudera Connectors enable Qlik Sense to access Hadoop data by passing SQL queries to the underlying Impala or Hive engines. Data can be extracted either as a batch load (using Hive connector) or real-time via Qlik’s Direct Discovery feature (using the Impala connector). Within the context of a Qlik Sense application, the connections to Cloudera become a source that can co-exist with other sources (relational databases, flat files, and so on) within the associative model of the Qlik Sense application in question.
Cloudera’s QuickStart VM provides an easy way to get started for an easy-to-set-up evaluation of Qlik Sense and Cloudera. (In addition, the Impala ODBC connector will be needed as well.) After logging into the VM, use the Cloudera Manager console to confirm that the services are all running—specifically hdfs1, impala1, hue and mapreduce1:
Next, use Hue, the open source GUI for Hadoop (see the bookmark in the browser integrated with the VM), to load data into HDFS. For example, the Tables Option allows the creation of new databases and tables.
Once the data has been either identified or loaded, Qlik Sense can connect to Cloudera by creating an ODBC connection using the Impala ODBC driver. The IP address of the Cloudera environment will be needed. For example, for the Cloudera QuickStart VM, the IP address can be found using the shell command ifconfig, which will show something like this:
Using the appropriate IP address and Port number, an ODBC DSN that used the Impala ODBC connector can be used to connect to Cloudera. Use the DSN to test the connectivity.
As mentioned previously, Qlik Sense can extract data from Cloudera into an application either: a) in-memory as a batch load process, b) Direct Discovery as a real-time extraction, or c) Direct Discovery in hybrid mode, where part of the data model is loaded as part of a batch load process and the other part of the data model is kept real-time. This article will focus on option (b) and (c).
When a business discovery application is created, Qlik Sense will automatically create and capture the logic that will be used to query the data sources in question. When Direct Discovery is used, Qlik Sense determines which part of the data model resides in-memory and which part is real-time. Qlik Sense will use a specific syntax (
DIRECT QUERY) for those constructs that are kept real-time. This syntax allows certain data elements not to be loaded into the data model but still have them available for query purposes from the user interface and to be combined for analysis with the elements that are in-memory.
When a Direct Discovery field is used in a chart, the corresponding SQL query runs on the external data source (Hadoop via Impala connector). Upon completion, the result of the query will be displayed in the chart. For example, a common approach is to keep fact tables as Direct Discovery tables and load the rest of the data model into memory (hybrid approach). This type of setup enables quick analysis for dimensional data for filtering and discovery purposes while keeping the fact data real-time.
When selections are made in the application, the associated data values of the Direct Discovery fields will be used as a filter for all other queries. With each selection, the Direct Discovery charts will be recalculated. It is possible to use calculation conditions to set a rule indicating when the chart should be calculated. Until that condition is met, Qlik will not run queries and the chart in question will not be calculated.
For the queries that represent the Direct Discovery part of the data model, Qlik will automatically use specific Qlik keywords. For example, the Direct Discovery statement will start with the keyword
DIRECT QUERY. Then the statement will use other keywords such as:
- The fields tagged as
DIMENSIONare loaded into memory as symbol tables so they can be used for quick selections and filtering purposes. The
DIMENSIONfields will be considered part of the associative data model and linked to other dimensions in the data model that are in-memory. When the application is loaded, Qlik will only load the unique values per field (standard behavior for all in-memory based fields too). This allows QlikView to use these columns to set up the associative links with the rest of the data model. If a
DIMENSIONfield needs to be treated separately from the rest of the associative data model, then the keyword
DETACHshould be used instead.
- The fields tagged as
DETAILwill exist only in the source data table within Hadoop, and they are not part of the in memory data model. Qlik will be aware of both, the
DETAILfields at the “meta level” so they may be used in charts and visualizations. However, the actual data of such fields reside only in the source.
MEASUREfields that are used in a chart or visualization should be used with an aggregation function (Sum, Count, Min, Max, Avg). Alternatively, the
DETAILfields are used for drilling purposes and are not involved in any chart expressions.
- The “
NATIVE” fields are displayed at the lowest level without aggregation within a chart. In the case of relational databases, it is also possible to execute source Database SQL functions with the Direct Discovery table by using the keyword
Typically all keywords are used in a Direct Discovery statement which looks like this:
NATIVE(‘month([OrderDate])’) as OrderMonth,
NATIVE(‘Year([OrderDate])’) as OrderYear
Qlik Direct Discovery capability comes with multi-table support. In other words, a Direct Discovery table could be defined as the output of the join of multiple tables at the source. It is also possible to have more than one Direct Discovery table within the same application. Qlik Direct Discovery can be used against any SQL compliant data source such as:
- Hadoop (i.e. Cloudera) via ODBC connectors
- ODBC/OLEDB data sources such as commercial relational databases
- SQL-based custom connectors (SAP SQL Connector, Custom QVX connectors for SQL compliant data stores, etc.)
- Other SQL-based sources (SAP HANA/Parstream/HP Vertica)
Qlik Sense is very efficient at combining data from multiple data sources, so that it can help organizations that need to complement the data residing in Hadoop with data residing in other sources. While all the users using the same application with Direct Discovery tables will be using the same connection, Qlik Sense provides a mechanism to have row-level based security rules at the user level.
Caching is used in order to improve the overall execution time, and user experience of Direct Discovery. As the same types of selections are made from various users, Qlik Sense will leverage the query from the cache rather than querying the source data. These cached result sets are shared across users.
There are a number of parameters that can be used in a Qlik Sense application to control and determine the exact behavior of Direct Discovery (amount of time cache will be stored in memory, the maximum number of parallel connections, the maximum rows that can be displayed at once when drilling down to
DETAIL fields, and so on.).
In terms of Qlik Sense Direct Discovery performance, it is important to keep the following considerations in mind:
- Performance of Direct Discovery queries on Cloudera will be directly affected by network performance, cluster size, and memory available in data nodes.
- It is highly recommended to use an efficient megastore such as MySql vs. Derby in order to improve performance on a concurrent Direct Discovery queries.
- Direct Discovery queries running through Impala can be accelerated by using the Apache Parquet file format.
As you should have gathered from the above, Qlik Sense and Cloudera work well together and integrate easily to provide business discovery capabilities against Big Data in a visual way. Your final result could look something like this: