Hive model includes the following types:
Hive entities are created and de-duped in Atlas using unique attribute qualifiedName, whose value should be formatted as detailed below. Note that dbName, tableName and columnName should be in lower case.
hive_db.qualifiedName: <dbName>@<clusterName> hive_table.qualifiedName: <dbName>.<tableName>@<clusterName> hive_column.qualifiedName: <dbName>.<tableName>.<columnName>@<clusterName> hive_process.queryString: trimmed query string in lower case
Atlas Hive hook registers with Hive to listen for create/update/delete operations and updates the metadata in Atlas, via Kafka notifications, for the changes in Hive. Follow the instructions below to setup Atlas hook in Hive:
<property> <name>hive.exec.post.hooks</name> <value>org.apache.atlas.hive.hook.HiveHook</value> </property>
The following properties in atlas-application.properties control the thread pool and notification details:
atlas.hook.hive.synchronous=false # whether to run the hook synchronously. false recommended to avoid delays in Hive query completion. Default: false atlas.hook.hive.numRetries=3 # number of retries for notification failure. Default: 3 atlas.hook.hive.queueSize=10000 # queue size for the threadpool. Default: 10000 atlas.cluster.name=primary # clusterName to use in qualifiedName of entities. Default: primary atlas.kafka.zookeeper.connect= # Zookeeper connect URL for Kafka. Example: localhost:2181 atlas.kafka.zookeeper.connection.timeout.ms=30000 # Zookeeper connection timeout. Default: 30000 atlas.kafka.zookeeper.session.timeout.ms=60000 # Zookeeper session timeout. Default: 60000 atlas.kafka.zookeeper.sync.time.ms=20 # Zookeeper sync time. Default: 20
Other configurations for Kafka notification producer can be specified by prefixing the configuration name with "atlas.kafka.". For list of configuration supported by Kafka producer, please refer to Kafka Producer Configs
Starting from 0.8-incubating version of Atlas, Column level lineage is captured in Atlas. Below are the details
For a simple CTAS below:
create table t2 as select id, name from T1
The lineage is captured as
* The HiveHook maps the LineageInfo in the HookContext to Column lineage instances
* The LineageInfo in Hive provides column-level lineage for the final FileSinkOperator, linking them to the input columns in the Hive Query
Apache Atlas provides a command-line utility, import-hive.sh, to import metadata of Apache Hive databases and tables into Apache Atlas. This utility can be used to initialize Apache Atlas with databases/tables present in Apache Hive. This utility supports importing metadata of a specific table, tables in a specific database or all databases and tables.
Usage 1: <atlas package>/hook-bin/import-hive.sh Usage 2: <atlas package>/hook-bin/import-hive.sh [-d <database regex> OR --database <database regex>] [-t <table regex> OR --table <table regex>] Usage 3: <atlas package>/hook-bin/import-hive.sh [-f <filename>] File Format: database1:tbl1 database1:tbl2 database2:tbl1