Druid Cluster Setup

 

What is Druid?

Apache Druid is a real-time analytics database designed for fast slice-and-dice analytics (“OLAP” queries) on large data sets. Druid is most often used as a database for powering use cases where real-time ingest, fast query performance, and high uptime are important. As such, Druid is commonly used for powering GUIs of analytical applications, or as a backend for highly-concurrent APIs that need fast aggregations. Druid works best with event-oriented data.

Ideal Druid Setup Architecture

This section describes the Druid processes and the suggested Master/Query/Data server organisation, as shown in the architecture diagram above.

Processes and Servers :

Druid has several process types, briefly described below:

  • Coordinator processes manage data availability on the cluster.
  • Overlord processes control the assignment of data ingestion workloads.
  • Broker processes handle queries from external clients.
  • Router processes are optional processes that can route requests to Brokers, Coordinators, and Overlords.
  • Historical processes store queryable data.
  • MiddleManager processes are responsible for ingesting data.
  • Peon processes are task execution engines spawned by MiddleManagers. Each Peon runs a separate JVM and is responsible for executing a single task. Peons always run on the same host as the MiddleManager that spawned them.

Server types :

Druid processes can be deployed any way you like, we have deployed with four components.

  • Master with zookeeper : Runs Coordinator and Overlord processes, manages data availability and ingestion. We also installed zookeeper on the same master node.
  • Query Node : Runs Broker and optional Router processes, handles queries from external clients.
  • Historical Data Node : Runs Historical processes, executes ingestion workloads and stores all queryable data.
  • Middle Manager Data Node : Runs MiddleManager processes, executes ingestion workloads and stores all queryable data.

For more details on process and server organisation, please see Druid Processes and Servers.

Cluster deployment Setup Steps:

Apache Druid is designed to be deployed as a scalable, fault-tolerant cluster. The information in the basic cluster tuning guide can help with your decision-making process and with sizing your configurations.

This simple cluster will feature:

  • A Master server to host the Coordinator and Overlord processes.
  • Two scalable, fault-tolerant Data servers running Historical and MiddleManager processes.
  • A query server, hosting the Druid Broker and Router processes.

In production, It is recommend deploying multiple Master servers and multiple Query servers in a fault-tolerant configuration based on your specific fault-tolerance needs, but you can get started quickly with one Master and one Query server and add more servers later.

Firstly We created 4 EC2 Centos-7 machines with configuration mentioned below and performed below common server hardening steps on all the servers:

  • Format file system for attached volume & mount volume.
  • Add mount point entries is fstab file for permanent mount.
  • Install Java 8, perl, mysql-connector-java and other basic required packages.

1. Master server with Zookeeper

The Coordinator and Overlord processes are responsible for handling the metadata and coordination needs of your cluster. They can be colocated together on the same server.

In the druid cluster setup, we deployed the equivalent of one AWS m5a.xlarge instance.

This hardware details:

  • 4 vCPUs
  • 16 GB RAM
  • EBS volume 100.00 GB Storage
  • Instance Name : druid-master

2. Data Server (Middle Manager)

Druid Brokers accept queries and farm them out to the rest of the cluster. They also optionally maintain an in-memory query cache. These servers benefit greatly from CPU and RAM.

In this example, we deployed the equivalent of one AWS c5.4xlarge instance.

This hardware details:

  • 16 vCPUs
  • 32 GB RAM
  • EBS volume 100.00 GB Storage
  • Instance Name : druid-middle-Manager

You can consider co-locating any open source UIs or query libraries on the same server that the Broker is running on.

3. Data Server (Historical)

Historicals and MiddleManagers can be colocated on the same server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM, and SSDs.

In this example, deployed the equivalent of one AWS m5a.2xlarge instance.

This hardware details:

  • 8 vCPUs
  • 32 GB RAM
  • EBS volume 100.00 GB + 200.00 GB storage
  • Instance Name : druid-middle-Manager

4. Query Node Server

Historicals and MiddleManagers can be colocated on the same server to handle the actual data in your cluster. These servers benefit greatly from CPU, RAM, and SSDs.

In this example, we deployed the equivalent of one AWS m5a.xlarge instance.

This hardware details:

  • 4 vCPUs
  • 16 GB RAM
  • EBS volume 100.00 GB storage
  • Instance Name : druid-query-node

Configure Zookeeper connection :

We are running ZK on Master servers, so first update conf/zoo.cfg to reflect how you plan to run ZK. Then, you can start the Master server processes together with ZK using:

bin/start-cluster-master-with-zk-server

In a druid cluster, we used ZK on master node only, the config file will look like below:

File Location : /data/apache-druid-0.17.0/conf/zk/zoo.cfg

# Server
#
tickTime=2000
dataDir=/data/zk
clientPort=2181
initLimit=5
syncLimit=2
# Autopurge
#
autopurge.snapRetainCount=5
autopurge.purgeInterval=1

Open ports (in firewall)

We are using VPN so we opened all ports for internal network only. But below are the some main ports which needs to be opened for communications.

Master Server

  • 1527 (Derby metadata store; not needed if you are using a separate metadata store like MySQL or PostgreSQL)
  • 2181 (ZooKeeper; not needed if you are using a separate ZooKeeper cluster)
  • 8081 (Coordinator)
  • 8090 (Overlord)

Data Server

  • 8083 (Historical)
  • 8091, 8100–8199 (Druid Middle Manager; you may need higher than port 8199 if you have a very high druid.worker.capacity)

Query Server

  • 8082 (Broker)
  • 8088 (Router, if used)

1. Start Master Server

Download and unpack the druid release archive. It’s best to do this on a single machine at first, since we will be editing the configurations and then copying the modified distribution out to all servers.

Download the release as per your requirements using the link on master server, do the all configuration on master node and once done copy it to other nodes.

Extract Druid by running the following commands in your terminal:

mkdir /datacd /data
tar -xzf apache-druid-0.17.0-bin.tar.gz
cd apache-druid-0.17.0

In the package, you should find:

  • LICENSE and NOTICE files
  • bin/* - scripts related to the single-machine quickstart
  • conf/druid/cluster/* - template configurations for a clustered setup
  • extensions/* - core Druid extensions
  • hadoop-dependencies/* - Druid Hadoop dependencies
  • lib/* - libraries and dependencies for core Druid
  • quickstart/* - files related to the single-machine quickstart

We’ll be editing the files in conf/druid/cluster/ in order to get things running. All the files are attached as below are configured on master and then copied to other nodes.

Configure metadata storage and deep storage:

Metadata storage

In conf/druid/cluster/_common/common.runtime.properties, replace "metadata.storage.*" with the IP address of the machine that you will use as your metadata store:

  • druid.metadata.storage.connector.connectURI
  • druid.metadata.storage.connector.host

The MySQL extension and PostgreSQL extension docs have instructions for extension configuration and initial database setup.

Deep storage

Druid relies on a distributed filesystem or large object (blob) store for data storage. The most commonly used deep storage implementations are S3 (popular for those on AWS) and HDFS (popular if you already have a Hadoop deployment).

We have used S3 as deepstorage.

S3

In conf/druid/cluster/_common/common.runtime.properties,

  • Add “druid-s3-extensions” to druid.extensions.loadList.
  • Comment out the configurations for local storage under “Deep Storage” and “Indexing service logs”.
  • Uncomment and configure appropriate values in the “For S3” sections of “Deep Storage” and “Indexing service logs” as below:

After this, you should make the following changes:

File Location: /data/apache-druid-0.17.0/conf/druid/cluster/_common/common.runtime.properties

druid.extensions.loadList=["druid-s3-extensions"]#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments
druid.storage.type=s3
druid.storage.bucket=druid-bucket-name
druid.storage.baseKey=druid/segments
druid.s3.accessKey=XXXX
druid.s3.secretKey=XXXX
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=druid-bucket-name
druid.indexer.logs.s3Prefix=druid/indexing-logs
Please see the S3 extension documentation for more info.

Druid configuration on master server:

  • Now We need to edit five configuration files on the master server as shown in the code block below.
  • The path of the file is mentioned in the title.
  • Once all files are configured properly, we need to copy the this druid distribution and configuration folder (/data/apache-druid-0.17.0) to other query and data nodes.
  • once you have configuration ready on all server need to start the respective service.

File Location: /data/apache-druid-0.17.0/conf/druid/cluster/_common/common.runtime.properties

#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
# Extensions specified in the load list will be loaded by Druid
# We are using local fs for deep storage - not recommended for production - use S3, HDFS, or NFS instead
# We are using local derby for the metadata store - not recommended for production - use MySQL or Postgres instead
# If you specify `druid.extensions.loadList=[]`, Druid won't load any extension from file system.
# If you don't specify `druid.extensions.loadList`, Druid will load all the extensions under root extension directory.
# More info: https://druid.apache.org/docs/latest/operations/including-extensions.html
druid.extensions.loadList=["kafka-emitter", "druid-basic-security", "druid-google-extensions", "druid-protobuf-extensions","druid-lookups-cached-global", "mysql-metadata-storage", "druid-s3-extensions", "druid-hdfs-storage", "druid-kafka-indexing-service", "druid-datasketches", "druid-parquet-extensions", "druid-avro-extensions"]
# If you have a different version of Hadoop, place your Hadoop client jar files in your hadoop-dependencies directory
# and uncomment the line below to point to your directory.
#druid.extensions.hadoopDependenciesDir=/my/dir/hadoop-dependencies
#
# Hostname
#
druid.host=xxx.xx.xx.xx #host ip server of current server on query or data node as individual server ip respectively.
#
# Logging
# Log all runtime properties on startup. Disable to avoid logging properties on startup:
druid.startup.logging.logProperties=true
#
# Zookeeper
druid.zk.service.host=xxx.xx.xx.xx # zk ip same as master
#druid.zk.service.host=localhost
druid.zk.paths.base=/data/zk
#
# Metadata storage
# For Derby server on your Druid Coordinator (only viable in a cluster with a single Coordinator, no fail-over):
#druid.metadata.storage.type=derby
#druid.metadata.storage.connector.connectURI=jdbc:derby://localhost:1527/var/druid/metadata.db;create=true
#druid.metadata.storage.connector.host=localhost
#druid.metadata.storage.connector.port=1527
# For MySQL (make sure to include the MySQL JDBC driver on the classpath):
druid.metadata.storage.type=mysql
druid.metadata.storage.connector.connectURI=jdbc:mysql://mysql-rds-url:3306/druid_metadata
druid.metadata.storage.connector.user=xxxxxx
druid.metadata.storage.connector.password=xxxxxx
# For PostgreSQL:
#druid.metadata.storage.type=postgresql
#druid.metadata.storage.connector.connectURI=jdbc:postgresql://db.example.com:5432/druid
#druid.metadata.storage.connector.user=...
#druid.metadata.storage.connector.password=...
#
# Deep storage
# For local disk (only viable in a cluster if this is a network mount):
#druid.storage.type=local
#druid.storage.storageDirectory=var/druid/segments
# For HDFS:
#druid.storage.type=hdfs
#druid.storage.storageDirectory=/druid/segments
# For S3:
druid.storage.type=s3
druid.storage.bucket=druid-bucket-name
druid.storage.baseKey=druid/segments
druid.s3.accessKey=XXXX
druid.s3.secretKey=XXXX
#
# Indexing service logs
# For local disk (only viable in a cluster if this is a network mount):
#druid.indexer.logs.type=file
#druid.indexer.logs.directory=var/druid/indexing-logs
# For HDFS:
#druid.indexer.logs.type=hdfs
#druid.indexer.logs.directory=/druid/indexing-logs
# For S3:
druid.indexer.logs.type=s3
druid.indexer.logs.s3Bucket=druid-bucket-name
druid.indexer.logs.s3Prefix=druid/indexing-logs
#
# Service discovery
#
druid.selectors.indexing.serviceName=druid/overlord
druid.selectors.coordinator.serviceName=druid/coordinator
#
# Monitoring
#
druid.monitoring.monitors=["org.apache.druid.java.util.metrics.JvmMonitor"]
druid.emitter=noop
druid.emitter.logging.logLevel=info
# Storage type of double columns
# ommiting this will lead to index double as float at the storage layer
druid.indexing.doubleStorage=double#
# Security
#
druid.server.hiddenProperties=["druid.s3.accessKey","druid.s3.secretKey","druid.metadata.storage.connector.password"]
#
# SQL
#
druid.sql.enable=true
#
# Lookups
#
druid.lookup.enableLookupSyncOnStartup=false
# Authenticator
druid.auth.authenticatorChain=["XxxxxxxxxxMetadataAuthenticator"]
druid.auth.authenticator.XxxxxxxxxxMetadataAuthenticator.type=basic
druid.auth.authenticator.XxxxxxxxxxMetadataAuthenticator.initialAdminPassword=XXXX
druid.auth.authenticator.XxxxxxxxxxMetadataAuthenticator.initialInternalClientPassword=XXXX
druid.auth.authenticator.XxxxxxxxxxMetadataAuthenticator.credentialsValidator.type=metadata
druid.auth.authenticator.XxxxxxxxxxMetadataAuthenticator.skipOnFailure=false
druid.auth.authenticator.XxxxxxxxxxMetadataAuthenticator.authorizerName=XxxxxxxxxxMetadataAuthorizer
# Escalator
druid.escalator.type=basic
druid.escalator.internalClientUsername=druid_system
druid.escalator.internalClientPassword=XXXX
druid.escalator.authorizerName=XxxxxxxxMetadataAuthorizer
# Authorizer
druid.auth.authorizers=["XxxxxxxxMetadataAuthorizer"]
druid.auth.authorizer.XxxxxxxxMetadataAuthorizer.type=basic
#
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
druid.host=xxx.xx.xx.xx #host ip of historical server
druid.service=druid/historical
druid.plaintextPort=8083
# HTTP server threads
druid.server.http.numThreads=60
# Processing threads and buffers
druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=2
druid.processing.numThreads=15
druid.processing.tmpDir=/data/druid/processing
# Segment storage
druid.segmentCache.locations=[{"path":"/data/druid/segment-cache","maxSize":300000000000}]
druid.server.maxSize=300000000000
# Query cache
druid.historical.cache.useCache=true
druid.historical.cache.populateCache=true
druid.cache.type=caffeine
druid.cache.sizeInBytes=256000000
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
druid.host=xxx.xx.xx.xx #host ip of middle manager server
druid.service=druid/middleManager
druid.plaintextPort=8091
# Number of tasks per middleManager
druid.worker.capacity=4
# Task launch parameters
druid.indexer.runner.javaOpts=-server -Xms1g -Xmx1g -XX:MaxDirectMemorySize=1g -Duser.timezone=UTC -Dfile.encoding=UTF-8 -XX:+ExitOnOutOfMemoryError -Djava.util.logging.manager=org.apache.logging.log4j.jul.LogManager
druid.indexer.task.baseTaskDir=/data/druid/task
# HTTP server threads
druid.server.http.numThreads=60
# Processing threads and buffers on Peons
druid.indexer.fork.property.druid.processing.numMergeBuffers=2
druid.indexer.fork.property.druid.processing.buffer.sizeBytes=100000000
druid.indexer.fork.property.druid.processing.numThreads=1
# Hadoop indexing
druid.indexer.task.hadoopWorkingPath=/data/druid/hadoop-tmp
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
druid.host=xxx.xx.xx.xx #host ip of query server
druid.service=druid/broker
druid.plaintextPort=8082
# HTTP server settings
druid.server.http.numThreads=60
# HTTP client settings
druid.broker.http.numConnections=50
druid.broker.http.maxQueuedBytes=10000000
# Processing threads and buffers
druid.processing.buffer.sizeBytes=500000000
druid.processing.numMergeBuffers=6
druid.processing.numThreads=1
druid.processing.tmpDir=/data/druid/processing
# Query cache disabled -- push down caching and merging instead
druid.broker.cache.useCache=false
druid.broker.cache.populateCache=false
# Licensed to the Apache Software Foundation (ASF) under one
# or more contributor license agreements. See the NOTICE file
# distributed with this work for additional information
# regarding copyright ownership. The ASF licenses this file
# to you under the Apache License, Version 2.0 (the
# "License"); you may not use this file except in compliance
# with the License. You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing,
# software distributed under the License is distributed on an
# "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY
# KIND, either express or implied. See the License for the
# specific language governing permissions and limitations
# under the License.
#
druid.host=xxx.xx.xx.xx #host ip of query server
druid.service=druid/router
druid.plaintextPort=8888
# HTTP proxy
druid.router.http.numConnections=50
druid.router.http.readTimeout=PT5M
druid.router.http.numMaxThreads=100
druid.server.http.numThreads=100
# Service discovery
druid.router.defaultBrokerServiceName=druid/broker
druid.router.coordinatorServiceName=druid/coordinator
# Management proxy to coordinator / overlord: required for unified web console.
druid.router.managementProxy.enabled=true

2. Start Middle-Manager Data Server

Copy the Druid distribution and your edited configurations from master server to your Data server.

From the distribution root, run the following command to start the Data server:

/data/apache-druid-0.17.0/bin/start-cluster-data-server

3. Start Historical Data Server

Copy the Druid distribution and your edited configurations from master server to your Data server.

From the distribution root, run the following command to start the Data server:

/data/apache-druid-0.17.0/bin/start-cluster-data-server

4. Start Query Server

Copy the Druid distribution and your edited configurations from master server to your query server.

From the distribution root, run the following command to start the Query server:

/data/apache-druid-0.17.0/bin/start-cluster-query-server

You can open the druid ui on the IP of query server on port 8888 as shown in the diagram below with credentials.

http://druid-query-node-ip:8888/

Druid provides a rich set of APIs (via HTTP and JDBC) for loading, managing, and querying your data. You can also interact with Druid via the built-in console (shown below).

Load data

Ref: https://github.com/apache/druid

Load streaming and batch data using a point-and-click wizard to guide you through ingestion setup. Monitor one of the tasks and ingestion supervisors.

Manage the cluster

Ref: https://github.com/apache/druid

Manage your cluster with ease. Get a view of your datasourcessegmentsingestion tasks, and services from one convenient location. All powered by SQL systems tables, allowing you to see the underlying query for each view.

Issue queries

Ref: https://github.com/apache/druid

Use the built-in query workbench to prototype DruidSQL and native queries or connect one of the many tools that help you make the most out of Druid.

Community

Community support is available on the druid-user mailing list, which is hosted at Google Groups.

Development discussions occur on dev@druid.apache.org, which you can subscribe to by emailing dev-subscribe@druid.apache.org.

Chat with Druid committers and users in real-time on the #druid channel in the Apache Slack team. Please use this invitation link to join the ASF Slack, and once joined, go into the #druid channel.

Reference:

Please do suggest if any changes or corrections are required. You can reach me for any queries.

Creator :

Pushkar Joshi

If you like this blog and find it helpful then please do share and do not forget to clap.

Comments

Popular posts from this blog

Prometheus & Grafana Setup on Kubernetes