Advanced Search

Background

Advanced Search in Atlas is also referred to as DSL-based Search.

Domain Specific Search (DSL) is a language with simple constructs that help users navigate Atlas data repository. The syntax loosely emulates the popular Structured Query Language (SQL) from relation database world.

Benefits of DSL:

  • Abstracts the implementation-level database constructs. This avoids the necessity of knowing about the underlying graph database constructs.
  • User are provided with an abstraction that helps them retrieve the data by just being aware of the types and their relationships within their dataset.
  • Allows for a way to specify the desired output.
  • Use of classifications is accounted for in the syntax.
  • Provides way to group and aggregate results.

We will be using the quick start dataset in the examples that follow. This dataset is comprehensive enough to be used to to demonstrate the various features of the language.

For details on the grammar, please refer to Atlas DSL Grammer on Github (Antlr G4 format).

Using Advanced Search

Within the Atlas UI, select Advanced in the Search pane on the left.

Notice that the Favorite Searches pane below the Search By Query box. Like Basic Search, it is possible to save the Advanced Searches as well.

Introduction to Domain Specific Language

DSL uses the familiar SQL-like syntax.

At a high-level a query has a from-where-select format. Additional keywords like grouby, orderby, limit can be used to added to affect the output. We will see examples of these below.

From Clause

Specifying the from clause is mandatory. Using the from keyword itself is optional. The value specified in the from clause acts as the source or starting point for the rest of the query to source its inputs.

Example: To retrieve all entities of type DB:

DB
from DB

In the absence of where for filtering on the source, the dataset fetched by the from clause is everything from the database. Based on the size of the data present in the database, there is a potential to overwhelm the server. The query processor thus adds limit clause with a default value set. See the section on limit clause for details.

Where Clause

The where clause allows for filtering over the dataset. This achieved by using conditions within the where clause.

A conditions is identifier followed by an operator followed by a literal. Literal must be enclosed in single or double quotes. Example, name = "Sales". An identifier can be name of the property of the type specified in the from clause or an alias.

Example: To retrieve entity of type Table with a specific name say time_dim:

from Table where name = 'time_dim'

It is possible to specify multiple conditions by combining them using and, or operators.

Example: To retrieve entity of type Table with name that can be either time_dim or customer_dim:

from Table where name = 'time_dim' or name = 'customer_dim'

Filtering based on a list of values is done using by specifying the values in the square brackets. A value array is a list of values enclosed within square brackets. This is a simple way to specify an OR clause on an identifier.

Note that having several OR clauses on the same attribute may be inefficient. Alternate way is to use the value array as shown in the example below.

Example: The query in the example above can be written using a value array as shown below.

from Table where name = ["customer_dim", "time_dim"]

A condition that uses the LIKE operator, allows for filtering using wildcards like '*' or '?'.

Example: To retrieve entity of type Table whose name ends with '_dim':

from Table where name LIKE '*_dim'

Additional forms of regular expressions can also be used.

Example: To retrieve DB whose name starts with R followed by has any 3 characters, followed by rt followed by at least 1 character, followed by none or any number of characters.

DB where name like "R???rt?*"
Using Date Literals

Dates used in literals need to be specified using the ISO 8601 format.

Dates in this format follow this notation:

  • yyyy-MM-ddTHH:mm:ss.SSSZ. Which means, year-month-day followed by time in hour-minutes-seconds-milli-seconds. Date and time need to be separated by 'T'. It should end with 'Z'.
  • yyyy-MM-dd. Which means, year-month-day.

Example: Date represents December 11, 2017 at 2:35 AM.

2017-12-11T02:35:0.0Z

Example: To retrieve entity of type Table created within 2017 and 2018.

from Table where createTime < '2018-01-01' and createTime > '2017-01-01'
Using Boolean Literals

Properties of entities of type boolean can be used within queries.

Eample: To retrieve entity of type hdfs_path whose attribute isFile is set to true and whose name is Invoice.

from hdfs_path where isFile = true or name = "Invoice"

Valid values for boolean literals are 'true' and 'false'.

Existence of a Property

The has keyword can be used with or without the where clause. It is used to check existence of a property in an entity.

Example: To retreive entity of type Table with a property locationUri.

Table has locationUri
from Table where Table has locationUri

Select Clause

If you noticed the output displayed on the web page, it displays a tabular display, each row corresponding to an entity and columns are properties of that entity. The select clause allows for choosing the properties of entity that are of interest.

Example: To retrieve entity of type Table with few properties:

from Table select owner, name, qualifiedName

Example: To retrieve entity of type Table for a specific table with some properties.

from Table where name = 'customer_dim' select owner, name, qualifiedName

To display column headers that are more meaningful, aliases can be added using the 'as' clause.

Example: To display column headers as 'Owner', 'Name' and 'FullName'.

from Table select owner as Owner, name as Name, qualifiedName as FullName
Note About Select Clauses

Given the complexity involved in using select clauses, these are the few rules to remember when using select clauses:

  • Works with all immediate attributes.
  • Works with Immediate attributes and aggregation on immediate attributes.
  • Referred attributes cannot be mixed with immediate attributes.

Example: To retrieve entity of type Table with name 'Sales' and display 'name' and 'owner' attribute of the referred entity DB.

Table where name = 'abcd' select DB.name, DB.owner

Current implementation does not allow the following:

Table where name = 'abcd' select DB.name, Table.name

Classification-based Filtering

In order to retrieve entities based on classification, a query would use is or isa keywords.

Example: To retrieve all entities of type Table that are tagged with Dimension classification.

from Table isa Dimension

Since, from is optional and is (or isa) are equivalent, the following queries yield the same results:

Table is Dimension

The is and isa clauses can also be used in where condition like:

from Table where Table isa Dimension

To search for all entities having a particular classification, simply use the name of the classification.

Example: To retrieve all entities that have Dimension classification.

Dimension

Limit & Offset Clauses

Often a query yields large number of results. To limit the outcome of the query, the limit and offset clauses are used.

Example: To retrieve only the 5 entities from a result set.

Column limit 5

The offset clauses retrieves results after the offset value.

Example: To retrieve only 5 entities from the result set after skipping the first 10.

Column limit 5 offset 10

The limit and offset clauses are usually specified in conjunction.

If no limit clause is specified in the query, a limit clause with a default limit (usually 100) is added to the query. This prevents the query from inadvertently fetching large number of results.

The offset clause is useful for displaying results in a user interface where few results from the result set are showing and more results are fetched as the user advances to next page.

Ordering Results

The orderby clause allows for sorting of results. Results are sorted in ascending order by default. Only immediate attributes can be used within this clause.

Ordering can be changed by using:

  • ASC Sort in ascending order. This is the default. If no ordering is specified after the orderby clause.
  • DESC Sort in descending order. This needs to be explicitly specified after the orderby clause.

Example: To retrieve the entities of type Column that are sorted in ascending order using the name property.

from Column orderby name
from Column orderby name asc

Example: Same results as above except that they are sorted in descending order.

from Column orderby name desc

Aggregate Functions

Let's look at aggregate functions:

  • sum: Adds (sums up) a value of the property specified, within the result set.
  • min: Finds the minimum value of the property specified, within a result set.
  • max: Finds the maximum value of the property specified, within a result set.
  • count: Finds the number of items specified by the group by clause.

These work only on immediate attributes.

Other examples of these in the Grouping Results section.

The count Keyword

Shows the number of items in a result set.

Example: To know how may entities of a type Column.

Column select count()

Example: Same as above with alias.

Column select count() as Cols

Example: To find the number of tables in a database.

Table where db.name = "Reporting" select count()
The max Keyword

Using this keyword it is possible to retrieve the maximum value of a property for an entity.

Example: Get the most recently created value of the createTime property of the Table entity.

Table select max(createTime)
The min Keyword

Using this keyword it is possible to retrieve the minimum value of a property for an entity.

Example: Get the least recently created value of the createTime property of the Table entity.

Table select min(createTime)

Grouping Results

The groupby clause groups results within the result using specified property.

Example: To retrieve entity of type Table such that tables belonging to an owner are together (grouped by owner).

Table groupby(owner)

While groupby can work without select, if aggregate functions are used within select clause, using groupby clause becomes mandatory as aggregate functions operate on a group.

Example: To retrieve entity of type Table such we know the most recently created entity.

Table groupby(createTime) select owner, name, max(createTime)

Example: To retrieve entity of type Table such we know the oldest entity.

Table groupby(createTime) select owner, name, min(createTime)

Example: To know the number of entities owned by each owner.

Table groupby(owner) select owner, count()

Where Clause With Complex Types

In the discussion so far we looked at where clauses with primitive types. This section will look at using properties that are non-primitive types.

In this model, the DB is modeled such that it is aware of all the Table it contains. Table on the other hand is aware of existence of the DB but is not aware of all the other Table instances within the system. Each Table maintains reference of the DB it belongs to.

Similar structure exists within the hive data model.

Example: To retrieve all the instances of the Table belonging to a database named 'Sales':

Table where db.name = "Sales"

The entity Column is modeled in a similar way. Each Table entity has outward edges pointing to Column entity instances corresponding to each column within the table.

Example: To retrieve all the Column entities for a given Table.

Table where name = "time_dim" select columns

The propeties of each Column entity type are displayed.

Using System Attributes

Each type defined within Atlas gets few attributes by default. These attributes help with internal book keeping of the entities. All the system attributes are prefixed with '__' (double underscore). This helps in identifying them from other attributes.

Following are the system attributes:

  • __guid Each entity within Atlas is assigned a globally unique identifier (GUID for short).
  • __modifiedBy Name of the user who last modified the entity.
  • __createdBy Name of the user who created the entity.
  • __state Current state of the entity. Please see below for details.
  • __timestamp Timestamp (date represented as integer) of the entity at the time of creation.
  • __modificationTimestamp Timestamp (date represented as integer) of the entity at the time of last modification.
State of an Entity

Entity within Atlas can be in the following states:

  • ACTIVE This is the state of entities that when it is available and is used within the system. It can be retrieved by default by searches.
  • DELETED When an entity is deleted, it's state is marked as DELETED. Entity in this state does not show up in search results. Explicit request needs to be made to retrieve this entity.
Using System Attributes in Queries

Example: To retrieve all entities that are deleted.

Asset where __state = "DELETED"

Example: To retrieve entity GUIDs.

Table select __guid

Example: To retrieve several system attributes.

hive_db select __timestamp, __modificationTimestamp, __state, __createdBy

Advanced Search REST API

Relevant models for these operations:

The V2 API

Get Results using DSL Search
Example See Examples sections below.
URL api/atlas/v2/search/dsl
Method GET
URL Parameters query: Query conforming to DSL syntax.
typeName: Type name of the entity to be retrived.
classification: Classification associated with the type or query.
limit: Maximum number of items in the result set.
offset: Starting index of the item in the result set.
Data Parameters None
Success Response The JSON will correspond to AtlasSearchResult.
Error Response Errors that are handled within the system will be returned as AtlasBaseException.
Method Signature @GET
@Path("/dsl")
@Consumes(Servlets.JSON_MEDIA_TYPE)
@Produces(Servlets.JSON_MEDIA_TYPE)
Examples
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?typeName=Table"
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?typeName=Column&classification=PII"
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?typeName=Table&classification=Dimension&limit=10&offset=2"
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?query=Table%20isa%20Dimension"
curl -X GET -u admin:admin -H "Content-Type: application/json" "http://localhost:21000/api/atlas/v2/search/dsl?query=Table%20isa%20Dimension&limit=5&offset=2"

Implementation Approach

The general approach followed in implementation of DSL within Atlas can be enumerated in following steps:

  • Parser parses the incoming query for syntax.
  • Abstract syntax tree is generated by for a query that is parsed successfully.
  • Syntax tree is 'walked' using visitor pattern.
  • Each 'visit' within the tree adds a step in the Gremlin pipeline.
  • When done, the generated script is executed using Gremlin Script Engine.
  • Results generated be the query, if any, are processed and packaged in AtlasSearchResult structure.

Differences Between Master and Earlier Versions

The following clauses are no longer supported:

  • path
  • loop

Resources