Apache Atlas uses and interacts with a variety of systems to provide metadata management and data lineage to data administrators. By choosing and configuring these dependencies appropriately, it is possible to achieve a high degree of service availability with Atlas. This document describes the state of high availability support in Atlas, including its capabilities and current limitations, and also the configuration required for achieving this level of high availability.
The architecture page in the wiki gives an overview of the various components that make up Atlas. The options mentioned below for various components derive context from the above page, and would be worthwhile to review before proceeding to read this page.
Currently, the Atlas Web Service has a limitation that it can only have one active instance at a time. In earlier releases of Atlas, a backup instance could be provisioned and kept available. However, a manual failover was required to make this backup instance active.
From this release, Atlas will support multiple instances of the Atlas Web service in an active/passive configuration with automated failover. This means that users can deploy and start multiple instances of the Atlas Web Service on different physical hosts at the same time. One of these instances will be automatically selected as an 'active' instance to service user requests. The others will automatically be deemed 'passive'. If the 'active' instance becomes unavailable either because it is deliberately stopped, or due to unexpected failures, one of the other instances will automatically be elected as an 'active' instance and start to service user requests.
An 'active' instance is the only instance that can respond to user requests correctly. It can create, delete, modify or respond to queries on metadata objects. A 'passive' instance will accept user requests, but will redirect them using HTTP redirect to the currently known 'active' instance. Specifically, a passive instance will not itself respond to any queries on metadata objects. However, all instances (both active and passive), will respond to admin requests that return information about that instance.
When configured in a High Availability mode, users can get the following operational benefits:
In the following sub-sections, we describe the steps required to setup High Availability for the Atlas Web Service. We also describe how the deployment and client can be designed to take advantage of this capability. Finally, we describe a few details of the underlying implementation.
The following pre-requisites must be met for setting up the High Availability feature.
To setup High Availability in Atlas, a few configuration options must be defined in the atlas-application.properties file. While the complete list of configuration items are defined in the Configuration Page, this section lists a few of the main options.
atlas.server.ids=id1,id2 atlas.server.address.id1=host1.company.com:21000 atlas.server.address.id2=host2.company.com:21000
atlas.server.ha.zookeeper.connect=zk1.company.com:2181,zk2.company.com:2181,zk3.company.com:2181
To verify that High Availability is working, run the following script on each of the instances where Atlas Web Service is installed.
$ATLAS_HOME/bin/atlas_admin.py -status
This script can print one of the values below as response:
Under normal operating circumstances, only one of these instances should print the value ACTIVE as response to the script, and the others would print PASSIVE.
The Atlas Web Service can be accessed in two ways:
In order to take advantage of the High Availability feature in the clients, there are two options possible.
The simplest solution to enable highly available access to Atlas is to install and configure some intermediate proxy that has a capability to transparently switch services based on status. One such proxy solution is HAProxy.
Here is an example HAProxy configuration that can be used. Note this is provided for illustration only, and not as a recommended production configuration. For that, please refer to the HAProxy documentation for appropriate instructions.
frontend atlas_fe bind *:41000 default_backend atlas_be backend atlas_be mode http option httpchk get /api/atlas/admin/status http-check expect string ACTIVE balance roundrobin server host1_21000 host1:21000 check server host2_21000 host2:21000 check backup listen atlas bind localhost:42000
The above configuration binds HAProxy to listen on port 41000 for incoming client connections. It then routes the connections to either of the hosts host1 or host2 depending on a HTTP status check. The status check is done using a HTTP GET on the REST URL /api/atlas/admin/status, and is deemed successful only if the HTTP response contains the string ACTIVE.
If one does not want to setup and manage a separate proxy, then the other option to use the High Availability feature is to build a client application that is capable of detecting status and retrying operations. In such a setting, the client application can be launched with the URLs of all Atlas Web Service instances that form the ensemble. The client should then call the REST URL /api/atlas/admin/status on each of these to determine which is the active instance. The response from the Active instance would be of the form {Status:ACTIVE}. Also, when the client faces any exceptions in the course of an operation, it should again determine which of the remaining URLs is active and retry the operation.
The AtlasClient class that ships with Atlas can be used as an example client library that implements the logic for working with an ensemble and selecting the right Active server instance.
Utilities in Atlas, like quick_start.py and import-hive.sh can be configured to run with multiple server URLs. When launched in this mode, the AtlasClient automatically selects and works with the current active instance. If a proxy is set up in between, then its address can be used when running quick_start.py or import-hive.sh.
The Atlas High Availability work is tracked under the master JIRA ATLAS-510. The JIRAs filed under it have detailed information about how the High Availability feature has been implemented. At a high level the following points can be called out:
As described above, Atlas uses JanusGraph to store the metadata it manages. By default, Atlas uses a standalone HBase instance as the backing store for JanusGraph. In order to provide HA for the metadata store, we recommend that Atlas be configured to use distributed HBase as the backing store for JanusGraph. Doing this implies that you could benefit from the HA guarantees HBase provides. In order to configure Atlas to use HBase in HA mode, do the following:
As described above, Atlas indexes metadata through JanusGraph to support full text search queries. In order to provide HA for the index store, we recommend that Atlas be configured to use Solr or Elasticsearch as the backing index store for JanusGraph.
In order to configure Atlas to use Solr in HA mode, do the following:
In order to configure Atlas to use Elasticsearch in HA mode, do the following:
Metadata notification events from Hooks are sent to Atlas by writing them to a Kafka topic called ATLAS_HOOK. Similarly, events from Atlas to other integrating components like Ranger, are written to a Kafka topic called ATLAS_ENTITIES. Since Kafka persists these messages, the events will not be lost even if the consumers are down as the events are being sent. In addition, we recommend Kafka is also setup for fault tolerance so that it has higher availability guarantees. In order to configure Atlas to use Kafka in HA mode, do the following:
$KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper <list of zookeeper host:port entries> --topic ATLAS_HOOK --replication-factor <numReplicas> --partitions 1 $KAFKA_HOME/bin/kafka-topics.sh --create --zookeeper <list of zookeeper host:port entries> --topic ATLAS_ENTITIES --replication-factor <numReplicas> --partitions 1 Here KAFKA_HOME points to the Kafka installation directory.
atlas.notification.embedded=false atlas.kafka.zookeeper.connect=<comma separated list of servers forming Zookeeper quorum used by Kafka> atlas.kafka.bootstrap.servers=<comma separated list of Kafka broker endpoints in host:port form> - Give at least 2 for redundancy.