Atlas allows users to define a model for the metadata objects they want to manage. The model is composed of definitions called ‘types’. Instances of ‘types’ called ‘entities’ represent the actual metadata objects that are managed. The Type System is a component that allows users to define and manage the types and entities. All metadata objects managed by Atlas out of the box (like Hive tables, for e.g.) are modelled using types and represented as entities. To store new types of metadata in Atlas, one needs to understand the concepts of the type system component.
A ‘Type’ in Atlas is a definition of how a particular type of metadata objects are stored and accessed. A type represents one or a collection of attributes that define the properties for the metadata object. Users with a development background will recognize the similarity of a type to a ‘Class’ definition of object oriented programming languages, or a ‘table schema’ of relational databases.
An example of a type that comes natively defined with Atlas is a Hive table. A Hive table is defined with these attributes:
Name: hive_table TypeCategory: Entity SuperTypes: DataSet Attributes: name: string db: hive_db owner: string createTime: date lastAccessTime: date comment: string retention: int sd: hive_storagedesc partitionKeys: array<hive_column> aliases: array<string> columns: array<hive_column> parameters: map<string,string> viewOriginalText: string viewExpandedText: string tableType: string temporary: boolean
The following points can be noted from the above example:
An ‘entity’ in Atlas is a specific value or instance of an Entity ‘type’ and thus represents a specific metadata object in the real world. Referring back to our analogy of Object Oriented Programming languages, an ‘instance’ is an ‘Object’ of a certain ‘Class’.
An example of an entity will be a specific Hive Table. Say Hive has a table called ‘customers’ in the ‘default’ database. This table will be an ‘entity’ in Atlas of type hive_table. By virtue of being an instance of an entity type, it will have values for every attribute that are a part of the Hive table ‘type’, such as:
guid: "9ba387dd-fa76-429c-b791-ffc338d3c91f" typeName: "hive_table" status: "ACTIVE" values: name: “customers” db: { "guid": "b42c6cfc-c1e7-42fd-a9e6-890e0adf33bc", "typeName": "hive_db" } owner: “admin” createTime: 1490761686029 updateTime: 1516298102877 comment: null retention: 0 sd: { "guid": "ff58025f-6854-4195-9f75-3a3058dd8dcf", "typeName": "hive_storagedesc" } partitionKeys: null aliases: null columns: [ { "guid": ""65e2204f-6a23-4130-934a-9679af6a211f", "typeName": "hive_column" }, { "guid": ""d726de70-faca-46fb-9c99-cf04f6b579a6", "typeName": "hive_column" }, ...] parameters: { "transient_lastDdlTime": "1466403208"} viewOriginalText: null viewExpandedText: null tableType: “MANAGED_TABLE” temporary: false
The following points can be noted from the example above:
With this idea on entities, we can now see the difference between Entity and Struct metatypes. Entities and Structs both compose attributes of other types. However, instances of Entity types have an identity (with a GUID value) and can be referenced from other entities (like a hive_db entity is referenced from a hive_table entity). Instances of Struct types do not have an identity of their own. The value of a Struct type is a collection of attributes that are ‘embedded’ inside the entity itself.
We already saw that attributes are defined inside metatypes like Entity, Struct, Classification and Relationship. But we implistically referred to attributes as having a name and a metatype value. However, attributes in Atlas have some more properties that define more concepts related to the type system.
An attribute has the following properties:
name: string, typeName: string, isOptional: boolean, isIndexable: boolean, isUnique: boolean, cardinality: enum
The properties above have the following meanings:
Using the above, let us expand on the attribute definition of one of the attributes of the hive table below. Let us look at the attribute called ‘db’ which represents the database to which the hive table belongs:
db: "name": "db", "typeName": "hive_db", "isOptional": false, "isIndexable": true, "isUnique": false, "cardinality": "SINGLE"
Note the “isOptional=true” constraint - a table entity cannot be created without a db reference.
columns: "name": "columns", "typeName": "array<hive_column>", "isOptional": optional, "isIndexable": true, “isUnique": false, "constraints": [ { "type": "ownedRef" } ]
Note the “ownedRef” constraint for columns. By doing this, we are indicating that the defined column entities should always be bound to the table entity they are defined with.
From this description and examples, you will be able to realize that attribute definitions can be used to influence specific modelling behavior (constraints, indexing, etc) to be enforced by the Atlas system.
Atlas comes with a few pre-defined system types. We saw one example (DataSet) in preceding sections. In this section we will see more of these types and understand their significance.
Referenceable: This type represents all entities that can be searched for using a unique attribute called qualifiedName.
Asset: This type extends Referenceable and adds attributes like name, description and owner. Name is a required attribute (isOptional=false), the others are optional.
The purpose of Referenceable and Asset is to provide modellers with way to enforce consistency when defining and querying entities of their own types. Having these fixed set of attributes allows applications and user interfaces to make convention based assumptions about what attributes they can expect of types by default.
Infrastructure: This type extends Asset and typically can be used to be a common super type for infrastructural metadata objects like clusters, hosts etc.
DataSet: This type extends Referenceable. Conceptually, it can be used to represent an type that stores data. In Atlas, hive tables, hbase_tables etc are all types that extend from DataSet. Types that extend DataSet can be expected to have a Schema in the sense that they would have an attribute that defines attributes of that dataset. For e.g. the columns attribute in a hive_table. Also entities of types that extend DataSet participate in data transformation and this transformation can be captured by Atlas via lineage (or provenance) graphs.
Process: This type extends Asset. Conceptually, it can be used to represent any data transformation operation. For example, an ETL process that transforms a hive table with raw data to another hive table that stores some aggregate can be a specific type that extends the Process type. A Process type has two specific attributes, inputs and outputs. Both inputs and outputs are arrays of DataSet entities. Thus an instance of a Process type can use these inputs and outputs to capture how the lineage of a DataSet evolves.