0

New in CDH 5.3: Apache Sentry Integration with HDFS

Starting in CDH 5.3, Apache Sentry integration with HDFS saves admins a lot of work by centralizing access control permissions across components that utilize HDFS.

It’s been more than a year and a half since a couple of my colleagues here at Cloudera shipped the first version of Sentry (now Apache Sentry (incubating)). This project filled a huge security gap in the Apache Hadoop ecosystem by bringing truly secure and dependable fine grained authorization to the Hadoop ecosystem and provided out-of-the-box integration for Apache Hive. Since then the project has grown significantly–adding support for Impala and Search and the wonderful Hue App to name a few significant additions.

In order to provide a truly secure and centralized authorization mechanism, Sentry deployments have been historically set up so that all Hive’s data and metadata are accessible only by HiveServer2 and every other user is cut out. This has been a pain point for Sqoop users as Sqoop does not use the HiveServer2 interface. Hence users with a Sentry-secured Hive deployment were forced to split the import task into two steps: simple HDFS import followed by manually loading the data into Hive.

With the inclusion of HDFS ACLs and the integration of Sentry into the Hive metastore in CDH 5.1, users were able to improve this situation and get the direct Hive import working again. However, this approach required manual administrator intervention to configure HDFS ACLs according to the Sentry configuration and needed a manual refresh to keep both systems in sync.

One of the large features included in the recently released CDH 5.3 is Sentry integration with HDFS, which enables customers to easily share data between Hive, Impala and all the other Hadoop components that interact with HDFS (MapReduce, Spark, Pig, and Sqoop, and so on) while ensuring that user access permissions only need to be set once, and that they are uniformly enforced.

The rest of this post focuses on the example of using Sqoop together with this Sentry feature. Sqoop data can now be imported into Hive without any additional administrator intervention. By exposing Sentry policies—what tables from which a user can select and to what tables they can insert—directly in HDFS, Sqoop will re-use the same policies that have been configured via GRANT/REVOKE statements or the Hue Sentry App and will import data into Hive without any trouble.

Configuration

In order for Sqoop to seamlessly import into a Sentry Secured Hive instance, the Hadoop administrator needs to follow a few configuration steps to enable all the necessary features. First, your cluster needs to be using the Sentry Service as backend for storing authorization metadata and not rely on the older policy files.

If you are already using Sentry Service and GRANT/REVOKE statements, you can directly jump to step 3).

  1. Make sure that you have Sentry service running on your cluster. You should see it in the service list:

    New in CDH 5.3: Apache Sentry Integration with HDFS

  2. And that Hive is configured to use this service as a backend for Sentry metadata:

    New in CDH 5.3: Apache Sentry Integration with HDFS

  3. Finally enable HDFS Integration with Sentry:

    New in CDH 5.3: Apache Sentry Integration with HDFS

Example Sqoop Import

Let’s assume that we have user jarcec who needs to import data into a Hive database named default. User jarcec is part of a group that is also called jarcec – in real life the name of the group doesn’t have to be the same as the username and that is fine.

With an unsecured Hive installation, the Hadoop administrator would have to jump in and grant writing privilege to user jarcec for directory /user/hive/warehouse or one of its subdirectories. With Sentry and HDFS integration, the Hadoop administrator no longer needs to jump in. Instead Sqoop will reuse the same authorization policies that has been configured through Hive SQL or via the Sentry Hue Application. Let’s assume that user bc is jarcec‘s Manager and already has privileges to grant privileges in the default database.

  1. bc starts by invoking beeline and connecting to HiveServer2:
  2. 1
    2
    3
    [bc@sqoopsentry1 ~]$ beeline
     
    1: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> !connect jdbc:hive2://sqoopsentry-1.vpc.cloudera.com:10000/default;principal=hive/sqoopsentry-1.vpc.cloudera.com@ENT.CLOUDERA.COM

  3. In case that user jarcec is not part of any role yet, we need to create a role for him:
  4. 1
    2
    3
    1: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> CREATE ROLE jarcec_role;
     
    No rows affected (0.769 seconds)

  5. And this new role jarcec_role needs to be granted to jarcec‘s group jarcec.
  6. >

    1
    2
    3
    1: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> GRANT ROLE jarcec_role to GROUP jarcec;
     
    No rows affected (0.651 seconds)

  7. And finally bc can grant access to database default (or any other) to the role jarcec_role;
  8. 1
    2
    3
    1: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> GRANT ALL ON DATABASE default TO ROLE jarcec_role;
     
    No rows affected (0.16 seconds)

By executing the steps above, user jarcec has been given privilege to do any action (insert or select) with all objects inside database default. That includes the ability to create new tables, insert data or simply querying existing tables. With those privileges user jarcec can run the following Sqoop command as he was used to:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
[jarcec@sqoopsentry1 ~]$ sqoop import connect jdbc:mysql://mysql.ent.cloudera.com/sqoop –username sqoop –password sqoop –table text <strong>–hive-import</strong>
 
14/12/14 15:37:38 INFO sqoop.Sqoop: Running Sqoop version: 1.4.5cdh5.3.0
 
.
 
14/12/14 15:38:58 INFO mapreduce.ImportJobBase: Transferred 249.7567 MB in 75.8448 seconds (3.293 MB/sec)
 
14/12/14 15:38:58 INFO mapreduce.ImportJobBase: Retrieved 1000000 records.
 
14/12/14 15:38:58 INFO manager.SqlManager: Executing SQL statement: SELECT t.* FROM `text` AS t LIMIT 1
 
14/12/14 15:38:58 INFO hive.HiveImport: Loading uploaded data into Hive
 
14/12/14 15:39:09 INFO hive.HiveImport: 14/12/14 15:39:09 WARN conf.HiveConf: DEPRECATED: Configuration property hive.metastore.local no longer has any effect. Make sure to provide a valid value for hive.metastore.uris if you are connecting to a remote metastore.
 
14/12/14 15:39:09 INFO hive.HiveImport:
 
14/12/14 15:39:09 INFO hive.HiveImport: Logging initialized using configuration in jar:file:/opt/cloudera/parcels/CDH5.3.01.cdh5.3.0.p0.26/jars/hivecommon0.13.1cdh5.3.0.jar!/hivelog4j.properties
 
14/12/14 15:39:12 INFO hive.HiveImport: OK
 
14/12/14 15:39:12 INFO hive.HiveImport: Time taken: 1.079 seconds
 
14/12/14 15:39:12 INFO hive.HiveImport: Loading data to table default.text
 
14/12/14 15:39:12 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=partm00000
 
14/12/14 15:39:12 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=partm00001
 
14/12/14 15:39:12 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=partm00002
 
14/12/14 15:39:13 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=partm00003
 
14/12/14 15:39:13 INFO hive.HiveImport: Table default.text stats: [numFiles=4, numRows=0, totalSize=261888896, rawDataSize=0]
 
14/12/14 15:39:13 INFO hive.HiveImport: OK
 
14/12/14 15:39:13 INFO hive.HiveImport: Time taken: 0.719 seconds
 
14/12/14 15:39:13 INFO hive.HiveImport: <strong>Hive import complete</strong>.
 
14/12/14 15:39:13 INFO hive.HiveImport: Export directory is not empty, keeping it.

And jarcec can easily confirm in beeline that data have been indeed imported into Hive:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
0: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> show tables from default;
 
+++
 
|  tab_name  |
 
+++
 
| text       |
 
+++
 
1 row selected (0.177 seconds)
 
0: jdbc:hive2://sqoopsentry-1.vpc.cloudera.co> select count(*) from text;
 
+++
 
|   _c0    |
 
+++
 
| 1000000  |
 
+++
 
1 row selected (72.188 seconds)

If Hive is configured to inherit permissions, you might notice that Sqoop will print out several warnings similar to this one:

1
14/12/14 15:39:12 INFO hive.HiveImport: setfacl: Permission denied. user=jarcec is not the owner of inode=partm00000

As there is no need to inherit HDFS permissions when Sentry is enabled in HDFS, you can safely ignore such messages.

That’s All, Folks!

The latest CDH release version 5.3.0 brings a bunch of new features. All Sqoop users should particularly check out the Sentry integration with HDFS as it will enable simple and straightforward import into Sentry-secured Hive deployments without the need to manually configure HDFS permissions. The same SQL interface that is used to grant access to various databases and tables is used to determine who can import (or export) data into Hive!

Jarcek Jarcec Cecho is an Engineering Manager at Cloudera, responsible for the Data Ingest team (see team blog). Jarcec is also a committer/PMC member for Sqoop, Apache Flume, Apache MRunit, Apache Datafu (incubating), and Apache Sentry (incubating). He is also the co-author of Apache Sqoop Cookbook.

keven

Leave a Reply

Your email address will not be published. Required fields are marked *