Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata objects that are managed. The Type System is a component that allows users to define and manage the types and entities. All metadata objects managed by Atlas out of the box (like Hive tables, for e.g.) are modelled using types and represented as entities. To store new types of metadata in Atlas, one needs to understand the concepts of the type system component.
A ‘Type’ in Atlas is a definition of how a particular type of metadata objects are stored and accessed. A type represents one or a collection of attributes that define the properties for the metadata object. Users with a development background will recognize the similarity of a type to a ‘Class’ definition of object oriented programming languages, or a ‘table schema’ of relational databases.
An example of a type that comes natively defined with Atlas is a Hive table. A Hive table is defined with these attributes:
Name: hive_table MetaType: Class SuperTypes: DataSet Attributes: name: String (name of the table) db: Database object of type hive_db owner: String createTime: Date lastAccessTime: Date comment: String retention: int sd: Storage Description object of type hive_storagedesc partitionKeys: Array of objects of type hive_column aliases: Array of strings columns: Array of objects of type hive_column parameters: Map of String keys to String values viewOriginalText: String viewExpandedText: String tableType: String temporary: Boolean
The following points can be noted from the above example:
An ‘entity’ in Atlas is a specific value or instance of a Class ‘type’ and thus represents a specific metadata object in the real world. Referring back to our analogy of Object Oriented Programming languages, an ‘instance’ is an ‘Object’ of a certain ‘Class’.
An example of an entity will be a specific Hive Table. Say Hive has a table called ‘customers’ in the ‘default’ database. This table will be an ‘entity’ in Atlas of type hive_table. By virtue of being an instance of a class type, it will have values for every attribute that are a part of the Hive table ‘type’, such as:
id: "9ba387dd-fa76-429c-b791-ffc338d3c91f" typeName: “hive_table” values: name: “customers” db: "b42c6cfc-c1e7-42fd-a9e6-890e0adf33bc" owner: “admin” createTime: "2016-06-20T06:13:28.000Z" lastAccessTime: "2016-06-20T06:13:28.000Z" comment: null retention: 0 sd: "ff58025f-6854-4195-9f75-3a3058dd8dcf" partitionKeys: null aliases: null columns: ["65e2204f-6a23-4130-934a-9679af6a211f", "d726de70-faca-46fb-9c99-cf04f6b579a6", ...] parameters: {"transient_lastDdlTime": "1466403208"} viewOriginalText: null viewExpandedText: null tableType: “MANAGED_TABLE” temporary: false
The following points can be noted from the example above:
With this idea on entities, we can now see the difference between Class and Struct metatypes. Classes and Structs both compose attributes of other types. However, entities of Class types have the Id attribute (with a GUID value) a nd can be referenced from other entities (like a hive_db entity is referenced from a hive_table entity). Instances of Struct types do not have an identity of their own. The value of a Struct type is a collection of attributes that are ‘embedded’ inside the entity itself.
We already saw that attributes are defined inside composite metatypes like Class and Struct. But we simplistically referred to attributes as having a name and a metatype value. However, attributes in Atlas have some more properties that define more concepts related to the type system.
An attribute has the following properties:
name: string, dataTypeName: string, isComposite: boolean, isIndexable: boolean, isUnique: boolean, multiplicity: enum, reverseAttributeName: string
The properties above have the following meanings:
Using the above, let us expand on the attribute definition of one of the attributes of the hive table below. Let us look at the attribute called ‘db’ which represents the database to which the hive table belongs:
db: "dataTypeName": "hive_db", "isComposite": false, "isIndexable": true, "isUnique": false, "multiplicity": "required", "name": "db", "reverseAttributeName": null
Note the “required” constraint on multiplicity. A table entity cannot be sent without a db reference.
columns: "dataTypeName": "array<hive_column>", "isComposite": true, "isIndexable": true, “isUnique": false, "multiplicity": "optional", "name": "columns", "reverseAttributeName": null
Note the “isComposite” true value for columns. By doing this, we are indicating that the defined column entities should always be bound to the table entity they are defined with.
From this description and examples, you will be able to realize that attribute definitions can be used to influence specific modelling behavior (constraints, indexing, etc) to be enforced by the Atlas system.
Atlas comes with a few pre-defined system types. We saw one example (DataSet) in the preceding sections. In this section we will see all these types and understand their significance.
Referenceable: This type represents all entities that can be searched for using a unique attribute called qualifiedName.
Asset: This type contains attributes like name, description and owner. Name is a required attribute (multiplicity = required), the others are optional. The purpose of Referenceable and Asset is to provide modellers with way to enforce consistency when defining and querying entities of their own types. Having these fixed set of attributes allows applications and User interfaces to make convention based assumptions about what attributes they can expect of types by default.
Infrastructure: This type extends Referenceable and Asset and typically can be used to be a common super type for infrastructural metadata objects like clusters, hosts etc.
DataSet: This type extends Referenceable and Asset. Conceptually, it can be used to represent an type that stores data. In Atlas, hive tables, Sqoop RDBMS tables etc are all types that extend from DataSet. Types that extend DataSet can be expected to have a Schema in the sense that they would have an attribute that defines attributes of that dataset. For e.g. the columns attribute in a hive_table. Also entities of types that extend DataSet participate in data transformation and this transformation can be captured by Atlas via lineage (or provenance) graphs.
Process: This type extends Referenceable and Asset. Conceptually, it can be used to represent any data transformation operation. For example, an ETL process that transforms a hive table with raw data to another hive table that stores some aggregate can be a specific type that extends the Process type. A Process type has two specific attributes, inputs and outputs. Both inputs and outputs are arrays of DataSet entities. Thus an instance of a Process type can use these inputs and outputs to capture how the lineage of a DataSet evolves.