Sqoop Atlas Bridge

Sqoop Model

The default hive model includes the following types:

  • Entity types:
    • sqoop_process
      • super-types: Process
      • attributes: name, operation, dbStore, hiveTable, commandlineOpts, startTime, endTime, userName
    • sqoop_dbdatastore
      • super-types: DataSet
      • attributes: name, dbStoreType, storeUse, storeUri, source, description, ownerName

  • Enum types:
    • sqoop_operation_type
      • values: IMPORT, EXPORT, EVAL
    • sqoop_dbstore_usage
      • values: TABLE, QUERY, PROCEDURE, OTHER

The entities are created and de-duped using unique qualified name. They provide namespace and can be used for querying as well:

  • sqoop_process.qualifiedName - dbStoreType-storeUri-endTime
  • sqoop_dbdatastore.qualifiedName - dbStoreType-storeUri-source

Sqoop Hook

Sqoop added a SqoopJobDataPublisher that publishes data to Atlas after completion of import Job. Today, only hiveImport is supported in SqoopHook. This is used to add entities in Atlas using the model detailed above.

Follow the instructions below to setup Atlas hook in Hive:

Add the following properties to to enable Atlas hook in Sqoop:

  • Set-up Atlas hook in <sqoop-conf>/sqoop-site.xml by adding the following:
   <property>
     <name>sqoop.job.data.publish.class</name>
     <value>org.apache.atlas.sqoop.hook.SqoopHook</value>
   </property>

  • Copy <atlas-conf>/atlas-application.properties to to the sqoop conf directory <sqoop-conf>/
  • Link <atlas-home>/hook/sqoop/*.jar in sqoop lib

Refer Configuration for notification related configurations

NOTES

  • Only the following sqoop operations are captured by sqoop hook currently - hiveImport