Genie Reference Guide

1. Introduction

This is the reference documentation for Genie version 4.0.0-SNAPSHOT. Here you will find high level concept documentation as well as documentation about how to install and configure this Genie version.

For more Genie information see the website.

For documentation of the REST API for this version of Genie please see the API Guide.

For a demo of this version of Genie please see the Demo Guide.

2. 4.0.0-SNAPSHOT Release Notes

The following are the release notes for Genie 4.0.0-SNAPSHOT.

3. 4.0.0 Release Notes

3.1. Major differences between 3.x and 4.x

Distributed execution - jobs are no longer restricted to execute on a Genie instance host
- All jobs are executed via agent: a standalone process which takes ownership of a job, executes it, and reports updates back to the server via gRPC
Job execution CLI - jobs can be launched outside of the Genie cluster via command-line interface
Resource selection criteria - allow for additional criteria in addition to matching based on tags
Resource resolution - provide a more flexible and customizable way to select command and cluster for a job
- Command can specify cluster criteria, removing the need for explicit 1:1 mapping between the two entities
- Command selected based on job command criteria. Cluster selected based on the combination of job and command cluster criteria
- Pluggable resolution logic for advanced usage (in the form of Groovy scripts)
Long-term job output archival - archive job outputs outside the cluster
- Customizable filtering of archived files
- Explicit tracking of archival status, surfaced via API
Job status change notifications
Explicit tracking of job archival status
Zookeeper-based cluster membership (for request routing) and leader election
Pagination of results in Java Genie client
Based on Spring Boot 2.3.x

4. 3.3.0 Release Notes

The following are the release notes for Genie 3.3.0.

Complete database schema and interaction code re-write for more normalization
- Allows more insights into job and user behavior by breaking apart large JSON blobs and other denormalized fields
- Improved cluster selection algorithm to speed up selection
- Projections on tables improve data transfer speeds
- Merge jobs tables to reduce duplicate data
- Surrogate primary keys for improved join performance and space usage vs. Sting based external unique ids
New fields added to jobs
- grouping
  - A way to provide search for jobs related to each other. E.g. The name of an entire workflow in a job scheduler can be set in this field to provide way to find all the jobs related to this workflow
  - Added to search API as optional field
- groupingInstance
  - Building on grouping this provides a field for the unique instance of the grouping e.g. the run identifier of the workflow
  - Added to search API as optional field
New field(s) added to Job Request, Job, Cluster, Command, Application
- metadata
  - Allows users to insert any additional metadata they wish to these resources. MUST be valid JSON.
  - Stored as blob so no search available. Meant for use by higher level systems to take metadata and parse it themselves for use in building up business use cases (lineage, relationships, etc) that the Genie data model doesn’t support natively
Switch to H2 for in memory database
Turn on Hibernate schema validation at boot

4.1. Upgrade Instructions

Flyway will upgrade the database schema for you. Due to performance reasons at large scale, the data for jobs are not copied over between versions by default. Data for applications, commands and clusters are copied so as not to interrupt operation. If you desire to copy over your old job data the tables were copied over into {tableName}_old and for MySQL and PostgreSQL scripts exist to copy over the job data. You can execute these scripts on your database and they should be able to run while your application is active and copy over data in the background.

If you run the data movement scripts it will remove the old tables. If you don’t they will sit in your schema. The next major Genie release will remove these tables in their schema upgrade scripts if they still exist. Feel free to drop them yourself if they’re no longer needed.

4.2. Library Upgrades

Upgrade Spring Boot to 2.3.9.RELEASE
Upgrade to Spring Cloud Hoxton.SR10 for cloud dependency management
- Reference Docs

5. 3.2.0 Release Notes

The following are the release notes for Genie 3.2.0.

5.1. Upgrade Instructions

If upgrading from existing 3.1.x installation run appropriate database upgrade script:

This must be done before deploying the 3.2.0 binary or Flyway will break. Going forward this will no longer be necessary and Genie binary will package upgrade scripts and Flyway will apply them automatically.

Once the script is run you can deploy the 3.2.0 binary. Once successfully deployed in your db schema you should see a new table schema_version. Do not delete or modify this table it is used by Flyway to manage upgrades.

5.2. Features

Database improvements
- Switch to Flyway for database upgrade management
Abstract internal eventing behind common interface
Bug fixes

5.3. Library Upgrades

Upgrade Spring Boot to 1.5.7.RELEASE
Upgrade to Spring Platform IO Brussels-SR5 for library dependency management
- Reference Docs
Upgrade to Spring Cloud Dalston.SR3 for cloud dependency management
- Reference Docs

5.4. Property Changes

5.5. Database Upgrades

Standardize database schemas for consistency
Switch to Flyway for database upgrade management
If using MySQL now require 5.6.3+ due to properties needed. See Installation for details

6. 3.1.0 Release Notes

The following are the release notes for Genie 3.1.0.

6.1. Features

Spring Session support made more flexible
- Now can support none (off), Redis, JDBC and HashMap as session data stores based on spring.session.store-type property
Actuator endpoints secured by default
- Follows new Spring default
- Turn off by setting management.security.enabled to false
Optional cluster load balancer via Admin supplied script
Add dependencies to the Cluster and Command entities
Add configurations to the JobRequest entity

6.2. Library Upgrades

Upgrade Spring Boot from 1.3.8.RELEASE to 1.5.4.RELEASE
- 1.4 Release Notes
- 1.5 Release Notes
Upgrade to Spring Platform IO Brussels-SR3 for library dependency management
- Reference Docs
Upgrade to Spring Cloud Dalston.SR2 for cloud dependency management
- Reference Docs
Removal of Spring Cloud Cluster
- Spring Cloud Cluster was deprecated and the leadership election functionality previously leveraged by Genie was moved to Spring Integration Zookeeper. That library is now used.
Tomcat upgraded to 8.5 from 8.0

6.3. Property Changes

6.3.1. Added

Property Description Default Value

Property	Description	Default Value
genie.jobs.clusters.loadBalancers.script.destination	The location on disk where the script source file should be stored after it is downloaded from `genie.jobs.clusters.loadBalancers.script.source`. The file will be given the same name.	file:///tmp/genie/loadbalancers/script/destination/
genie.jobs.clusters.loadBalancers.script.enabled	Whether the script based load balancer should be enabled for the system or not. See also: `genie.jobs.clusters.loadBalancers.script.source` See also: `genie.jobs.clusters.loadBalancers.script.destination`	false
genie.jobs.clusters.loadBalancers.script.order	The order which the script load balancer should be evaluated. The lower this number the sooner it is evaluated. 0 would be the first thing evaluated if nothing else is set to 0 as well. Must be < 2147483647 (Integer.MAX_VALUE). If no value set will be given Integer.MAX_VALUE - 1 (default).	2147483646
genie.jobs.clusters.loadBalancers.script.refreshRate	How frequently to refresh the load balancer script (in milliseconds)	300000
genie.jobs.clusters.loadBalancers.script.source	The location of the script the load balancer should load to evaluate which cluster to use for a job request	file:///tmp/genie/loadBalancers/script/source/loadBalance.js
genie.jobs.clusters.loadBalancers.script.timeout	The amount of time (in milliseconds) that the system will attempt to run the cluster load balancer script before it forces a timeout	5000
genie.tasks.databaseCleanup.batchSize	The number of jobs to delete from the database at a time. Genie will loop until all jobs older than the retention time are deleted.	10000
management.security.roles	The roles a user needs to have in order to access the Actuator endpoints	ADMIN
security.oauth2.resource.filter-order	The order the OAuth2 resource filter is places within the spring security chain	3
spring.data.redis.repositories.enabled	Whether Spring data repositories should attempt to be created for Redis	true
spring.session.store-type	The back end storage system for Spring to store HTTP session information. See Spring Boot Session for more information. Currently on classpath only none, hash_map, redis and jdbc will work.	hash_map

genie.jobs.clusters.loadBalancers.script.destination

The location on disk where the script source file should be stored after it is downloaded from genie.jobs.clusters.loadBalancers.script.source. The file will be given the same name.

file:///tmp/genie/loadbalancers/script/destination/

genie.jobs.clusters.loadBalancers.script.enabled

Whether the script based load balancer should be enabled for the system or not. See also: genie.jobs.clusters.loadBalancers.script.source See also: genie.jobs.clusters.loadBalancers.script.destination

false

genie.jobs.clusters.loadBalancers.script.order

The order which the script load balancer should be evaluated. The lower this number the sooner it is evaluated. 0 would be the first thing evaluated if nothing else is set to 0 as well. Must be < 2147483647 (Integer.MAX_VALUE). If no value set will be given Integer.MAX_VALUE - 1 (default).

2147483646

genie.jobs.clusters.loadBalancers.script.refreshRate

How frequently to refresh the load balancer script (in milliseconds)

300000

genie.jobs.clusters.loadBalancers.script.source

The location of the script the load balancer should load to evaluate which cluster to use for a job request

file:///tmp/genie/loadBalancers/script/source/loadBalance.js

genie.jobs.clusters.loadBalancers.script.timeout

The amount of time (in milliseconds) that the system will attempt to run the cluster load balancer script before it forces a timeout

5000

genie.tasks.databaseCleanup.batchSize

The number of jobs to delete from the database at a time. Genie will loop until all jobs older than the retention time are deleted.

10000

management.security.roles

The roles a user needs to have in order to access the Actuator endpoints

ADMIN

security.oauth2.resource.filter-order

The order the OAuth2 resource filter is places within the spring security chain

spring.data.redis.repositories.enabled

Whether Spring data repositories should attempt to be created for Redis

true

spring.session.store-type

The back end storage system for Spring to store HTTP session information. See Spring Boot Session for more information. Currently on classpath only none, hash_map, redis and jdbc will work.

hash_map

6.3.2. Changed Default Value

Property	Old Default	New Default
genie.tasks.clusterChecker.healthIndicatorsToIgnore	memory,genie,discoveryComposite	memory,genieMemory,discoveryComposite
management.security.enabled	false	true

Property

Old Default

New Default

genie.tasks.clusterChecker.healthIndicatorsToIgnore

memory,genie,discoveryComposite

memory,genieMemory,discoveryComposite

management.security.enabled

false

true

6.3.3. Removed

6.3.4. Renamed

Old Name	New Name
multipart.max-file-size	spring.http.multipart.max-file-size
multipart.max-request-size	spring.http.multipart.max-file-size
spring.cloud.cluster.leader.enabled	genie.zookeeper.enabled
spring.cloud.cluster.zookeeper.connect	genie.zookeeper.connectionString
spring.cloud.cluster.zookeeper.namespace	genie.zookeeper.leader.path
spring.datasource.min-idle	spring.datasource.tomcat.min-idle
spring.datasource.max-idle	spring.datasource.tomcat.max-idle
spring.datasource.max-active	spring.datasource.tomcat.max-active
spring.datasource.validation-query	spring.datasource.tomcat.validation-query
spring.datasource.test-on-borrow	spring.datasource.tomcat.test-on-borrow
spring.datasource.test-on-connect	spring.datasource.tomcat.test-on-connect
spring.datasource.test-on-return	spring.datasource.tomcat.test-on-return
spring.datasource.test-while-idle	spring.datasource.tomcat.test-while-idle
spring.datasource.min-evictable-idle-time-millis	spring.datasource.tomcat.min-evictable-idle-time-millis
spring.datasource.time-between-eviction-run-millis	spring.datasource.tomcat.time-between-eviction-run-millis
spring.jpa.hibernate.naming-strategy	spring.jpa.hibernate.naming.strategy

Old Name

New Name

multipart.max-file-size

spring.http.multipart.max-file-size

multipart.max-request-size

spring.http.multipart.max-file-size

spring.cloud.cluster.leader.enabled

genie.zookeeper.enabled

spring.cloud.cluster.zookeeper.connect

genie.zookeeper.connectionString

spring.cloud.cluster.zookeeper.namespace

genie.zookeeper.leader.path

spring.datasource.min-idle

spring.datasource.tomcat.min-idle

spring.datasource.max-idle

spring.datasource.tomcat.max-idle

spring.datasource.max-active

spring.datasource.tomcat.max-active

spring.datasource.validation-query

spring.datasource.tomcat.validation-query

spring.datasource.test-on-borrow

spring.datasource.tomcat.test-on-borrow

spring.datasource.test-on-connect

spring.datasource.tomcat.test-on-connect

spring.datasource.test-on-return

spring.datasource.tomcat.test-on-return

spring.datasource.test-while-idle

spring.datasource.tomcat.test-while-idle

spring.datasource.min-evictable-idle-time-millis

spring.datasource.tomcat.min-evictable-idle-time-millis

spring.datasource.time-between-eviction-run-millis

spring.datasource.tomcat.time-between-eviction-run-millis

spring.jpa.hibernate.naming-strategy

spring.jpa.hibernate.naming.strategy

6.4. Database Upgrades

Add cluster and command dependencies table
Rename MySQL and PostgreSQL schema files
Index 'name' column of Jobs table
Switch Job and JobRequest tables 'description' column to text
Switch Applications' table 'cluster_criterias' and 'command_criteria' columns to text
Increase the size of 'tags' column for applications, clusters, commands, jobs, job_requests
Switch JobRequest table 'dependencies' column to text
Add job request table configs column
Double the size of 'config' and 'dependencies' column for Application, Cluster, Command

7. Concepts

7.1. Job overview

This section provides an high level overview of what a Genie job is. More details on the concepts covered here are provided in later sections.

7.1.1. API jobs vs agent CLI jobs

Genie provides 2 ways to execute a job:

Submit via API (delegating execution to the server)
Execute locally via agent CLI

API jobs are easier to submit with any REST client. For example, the user can just say "Run query Q with SparkSQL". The server takes care of all the details, the user can sit back, wait for the job to complete and then retrieve the output via API.

A job request via API may look something like this:

{
  "command" : ["type:spark-sql", "version:2.1.1"],
  "cluster" : ["data:production"],
  "arguments" : "-e 'select count(*) from some_table'"
  "memory" : 1024,
  "user" : "jdoe"
}

The advantages of API jobs are:

The server takes care of execution end-to-end
Any rest client is sufficient to submit jobs, check their statuses and retrieve their results

Genie V4 and above provides a different way to execute jobs: the agent CLI.

The agent CLI can be used to launch jobs outside of the Genie cluster (e.g., on the user’s laptop) while still leveraging Genie for configuration, monitoring, audit, archival, etc.

An agent job launched via CLI may look something like this:

$ genie-agent exec  \
  --command='type:spark-shell'
  --cluster='data:test' \
  -- --shell-color=true'

There are two primary reasons for choosing agent CLI over API job request:

Interactive jobs (such as Spark Shell or other REPL interfaces)
More control over the job environment (geographical location, special hardware, etc)

7.1.1.1. Job Lifecycle

Regardless of how a job is launched, it will leverage Genie to:

Resolve symbolic resources to concrete cluster/commands installed
Track job status
Access logs and outputs during and after execution

Every job goes through the following stages:

Unique job id is reserved upon job request
Job resources criteria are resolved to concrete resources (cluster/command pair)
Job setup downloads and configures all necessary libraries, binaries, Configurations
Job is launched and monitored
Job final status, statistics and outputs are preserved for later consumption and audit

Figure 1. Lifecycle of an API job

User submits a job request via REST API
Genie server creates a record, resolves resources (command, cluster) and fills-in other details
Genie server launches an agent to execute the job (on the same host, or in a remote container)
The agent connects back to the Genie cluster and retrieves the job specification. It also keeps the server informed of its progress.
The agent sets up the job (downloads dependencies, runs setup scripts, …), then launches it, all while regularly updating the server. After the job is done, the agent archives outputs and logs.
While the job is running and after completion, the user can retrieve job status and download outputs and logs (and even kill the job).

Figure 2. Lifecycle of a CLI job

The user launches a job via CLI (this job execute locally, wherever the agent is invoked)
The agent connects to the Genie cluster and creates the job record, resolves resources (cluster, command). It also keeps the server informed of its progress.
The agent sets up the job (downloads dependencies, runs setup scripts, …), then launches it, all while regularly updating the server. After the job is done, the agent archives outputs and logs.
While the job is running and after completion, the user can retrieve job status and download outputs and logs (and even kill the job).

7.1.1.2. Status Transitions

Possible statuses of a job:

RESERVED - The id of the job has been reserved, and the corresponding request persisted
RESOLVED - The job request has been successfully resolved into a concrete job specification
ACCEPTED - The job has been accepted by the system via the REST API
CLAIMED - The job has been claimed by an agent for immediate execution
INIT - Job has begun setup (e.g, download and unpacking of dependencies, etc.)
RUNNING - Job has launched and is now running
SUCCEEDED - The job process has completed and exited with code 0
KILLED - The job was killed (due to timeout, user request, etc.)
FAILED - The job failed (due to unsatisfiable criteria, errors during setup, non-zero exit code, etc.)

Transitions between these states are slightly different depending on the kind of job:

Table 1. State transitions for API-submitted jobs
Transition	Event
`null → RESERVED`	A valid job request was received and saved (API jobs only: the attachments (if any) were successfully saved)
`RESERVED → RESOLVED`	The job request criteria were successfully resolved into a job specification
`RESERVED → FAILED`	The job request criteria could not be satisfied
`RESOLVED → ACCEPTED`	(API jobs only) The server is proceeding to launch an agent to execute this job
`ACCEPTED → CLAIMED`	(API jobs only) The server-launched agent claimed this job for execution
`ACCEPTED → FAILED`	(API jobs only) The server failed to launch an agent to execute the job
`RESOLVED → CLAIMED`	(CLI jobs only) The CLI-launched agent claimed this job for execution
`RESOLVED → FAILED`	No agent has claimed this job for execution, and the server marked the job failed
`CLAIMED → INIT`	The agent started job setup (download dependencies, etc.)
`CLAIMED → FAILED`	The agent that claimed this job stopped heartbeating, and the server marked the job failed
`INIT → RUNNING`	The job setup completed successfully and the job process was launched
`INIT → FAILED`	Job setup failed (missing dependency, setup script error) or the agent that claimed the job stopped hearbeating
`RUNNING → SUCCEEDED`	The job command sub-process completed with exit code 0
`RUNNING → FAILED`	The job command sub-process completed with exit code different than 0
`INIT → KILLED`, `RUNNING → KILLED`	The job was killed (as requested by the user via API, or due to timeout or other limits exceeded)

7.2. Data Model

The Genie 4 data model contains several modifications and additions to the Genie 3 data model to enable even more flexibility, modularity and meta data retention. This section will go over the purpose of the various resources available from the Genie API and how they interact.

7.2.1. Caveats

The specific resource fields are NOT defined in this document. These fields are available in the REST API documentation

7.2.2. Resources

The following sections describe the various resources available from the Genie REST APIs. You should reference the API Docs for how to manipulate these resources. These sections will focus on the high level purposes for each resource and how they rely and/or interact with other resources within the system.

7.2.2.1. Configuration Resources

The following resources (applications, commands and clusters) are considered configuration, or admin, resources. They’re generally set up by the Genie administrator and available to all users for user with their jobs.

7.2.2.1.1. Application Resource

An application resource represents pretty much what you’d expect. It is a reusable set of binaries, configuration files and setup files that can be used to install and configure (surprise!) an application. Generally these resources are necessary when an application isn’t already installed and on the PATH of the host running the job.

When a job is run the job Genie will download all the dependencies, configuration files and setup files of each application and cluster and store it all in the job working directory. It will then execute the setup script in order to install that application for that job. Genie is "dumb" as to the contents or purpose of any of these files so the onus is on the administrators to create and test these packages.

Applications are very useful for decoupling application binaries from a Genie deployment. For example you could deploy a Genie cluster and change the version of Hadoop, Hive, Spark that Genie will download without actually re-deploying Genie. Applications can be combined together via a command. This will be explained more in the Command section.

The first entity to talk about is an application. Applications are linked to commands in order for binaries and configurations to be downloaded and installed at runtime. Within Netflix this is frequently used to deploy new clients without redeploying a Genie cluster.

At Netflix our applications frequently consists of a zip of all the binaries uploaded to s3 along with a setup file to unzip and configure environment variables for that application.

7.2.2.1.2. Command Resource

Commands resources primarily represent what a user would enter at the command line if you wanted to run a process on a cluster and what binaries (i.e., applications) you would need on your PATH to make that possible.

Commands can have configuration, setup and dependencies files just like applications but primarily they should have an ordered list of applications associated with them if necessary. For example, lets take a typical scenario involving running Hive. To run Hive you generally need a few things:

A cluster to run its processing on (we’ll get to that in the Cluster section)
A hive-site.xml file which says what metastore to connect to amongst other settings
Hive binaries
Hadoop configurations and binaries

So a typical setup for Hive in Genie would be to have one, or many, Hive commands configured. Each command would have its own hive-site.xml pointing to a specific metastore (prod, test, etc). Alternatively, site-specific configuration can be associated to clusters and will be available to all commands executing against it. The command would depend on Hadoop and Hive applications already configured which would have all the default Hadoop and Hive binaries and configurations. All this would be combined in the job working directory in order to run Hive.

You can have any number of commands configured in the system. They should then be linked to the clusters they can execute on. Clusters are explained next.

7.2.2.1.3. Cluster

A cluster stores all the details of an execution cluster including connection information, properties, etc. Some cluster examples are Hadoop, Spark, Presto, etc. Every cluster can be linked to a set of commands that it can run.

Genie does not launch clusters for you. It merely stores metadata about the clusters you have running in your environment so that jobs using commands and applications can connect to them.

Once a cluster has been linked to commands your Genie instance is ready to start running jobs. The job resources are described in the following section. One important thing to note is that the list of commands linked to the cluster is a priority ordered list. That means if you have two pig commands available on your system for the same cluster the first one found in the list will be chosen provided all tags match. See How it Works for more details.

7.2.2.2. Tagging

An important concept is the tagging of resources. Genie relies heavily on tags for how the system discovers resources like clusters and commands for a job. Each of the core resources has a set of tags that can be associated with them. These tags can be of set to whatever you want but it is recommended to come up with some sort of consistent structure for your tags to make it easier for users to understand their purpose. For example at Netflix we’ve adopted some of the following standards for our tags:

sched:{something}
- This corresponds to any schedule like that this resource (likely a cluster) is expected to adhere to
- e.g. sched:sla or sched:adhoc
type:{something}
- e.g. type:yarn or type:presto for a cluster or type:hive or type:spark for a command
ver:{something}
- The specific version of said resource
- e.g. two different Spark commands could have ver:1.6.1 vs ver:2.0.0
data:{something}
- Used to classify the type of data a resource (usually a command) will access
- e.g. data:prod or data:test

7.2.2.3. Job Resources

The following resources all relate to a user submitting and monitoring a given job. They are split up from the Genie 2 Job idea to provide better separation of concerns as usually a user doesn’t care about certain things. What node a job ran on or its Linux process exit code for example.

Users interact with these entities directly though all but the initial job request are read-only in the sense you can only get their current state back from Genie.

7.2.2.3.1. Job Request

This is the resource you use to kick off a job. It contains all the information the system needs to run a job. Optionally the REST APIs can take attachments. All attachments and file dependencies are downloaded into the root of the jobs working directory. The most important aspects are the command line arguments, the cluster criteria and the command criteria. These dictate the which cluster, command and arguments will be used when the job is executed. See the How it Works section for more details.

7.2.2.3.2. Job

The job resource is created in the system after a Job Request is received. All the information a typical user would be interested in should be contained within this resource. It has links to the command, cluster and applications used to run the job as well as the meta information like status, start time, end time and others. See the REST API documentation for more details.

7.2.2.3.3. Job Execution

The job execution resource contains information about where a job was run and other information that may be more interesting to a system admin than a regular user. Frequently useful in debugging.

A job contains all the details of a job request and execution including any command line arguments. Based on the request parameters, a cluster and command combination is selected for execution. Job requests can also supply necessary files to Genie either as attachments or using the file dependencies field if they already exist in an accessible file system. As a job executes, its details are recorded in the job record within the Genie database.

7.2.2.3.4. Resource configuration vs. dependencies

Genie allows associating files with the resources above so that these files are retrieved and placed in the job execution directory as part of the setup. When creating an Application, a Cluster, a Command or a Job, it is possible to associate configs and/or dependencies. Configs are expected to be small configuration files (XML, JSON, YAML, …), whereas dependencies are expected to be larger and possibly binary (Jars, executables, libraries, etc). Application, Cluster, and Command dependencies are deleted after job completion (unless Genie is configured to preserve them), to avoid storing and archiving them over and over. Configurations are preserved. Job configurations and dependencies are also preserved.

7.2.3. Wrap-up

This section was intended to provide insight into how the Genie data model is thought out and works together. It is meant to be very generic and support as many use cases as possible without modifications to the Genie codebase.

7.3. How it Works

This section is meant to provide context for how Genie can be configured with Clusters, Commands and Applications (see Data Model for details) and then how these work together in order to run a job on a Genie node.

7.3.1. Resource Configuration

This section describes how configuration of Genie works from an administrator point of view. This isn’t how to install and configure the Genie application itself. Rather it is how to configure the various resources involved in running a job.

7.3.1.1. Register Resources

All resources (clusters, commands, applications) should be registered with Genie before attempting to link them together. Any files these resources depend on should be uploaded somewhere Genie can access them (S3, web server, mounted disk, etc).

Tagging of the resources, particularly Clusters and Commands, is extremely important. Genie will use the tags in order to find a cluster/command combination to run a job. You should come up with a convenient tagging scheme for your organization. At Netflix we try to stick to a pattern for tags structures like {tag category}:{tag value}. For example type:yarn or data:prod. This allows the tags to have some context so that when users look at what resources are available they can find what to submit their jobs with so it is routed to the correct cluster/command combination.

7.3.1.2. Linking Resources

Once resources are registered they should be linked together. By linking we mean to represent relationships between the resources.

7.3.1.2.1. Command Cluster Criteria

Starting in Genie V4, commands and clusters are not longer expliticly linked.

Instead, the command provides criteria describing the clusters suitable to execute it.

For example, a spark-submit (v.2.1.1) command may declare the following cluster criteria: * type:hadoop * spark-version:2.1

This section of the command definition says: "this command is suitable to run against hadoop clusters running version 2.1".

Implicitly, this excludes clusters that are not hadoop (say, Presto clusters), or Hadoop clusters that don’t support Spark 2.1.

The command’s cluster criteria are combined with the job’s cluster criteria to further narrow the set of viable clusters.

7.3.1.2.2. Applications for a Command

Linking application(s) to commands means that a command has a dependency on said application(s). The order of the applications added is important because Genie will setup the applications in that order. Meaning if one application depends on another (e.g. Spark depends on Hadoop on classpath for YARN mode) Hadoop should be ordered first. All applications must successfully be installed before Genie will start running the job.

7.3.2. Job Submission

The system admin has everything registered and linked together. Things could change but that’s mostly transparent to end users, who just want to run jobs. How does that work? This section walks through what happens at a high level when a job is submitted.

7.3.2.1. Cluster and command matching

In order to submit a job request there is some work a user will have to do up front. What kind of job are they running? What cluster do they want to run on? What command do they want to use? Do they care about certain details like version or just want the defaults? Once they determine the answers to the questions they can decide how they want to tag their job request for the clusterCriterias and commandCriteria fields.

General rule of thumb for these fields is to use the lowest common denominator of tags to accomplish what a user requires. This will allow the most flexibility for the job to be moved to different clusters or commands as need be. For example if they want to run a Spark job and don’t really care about version it is better to just say "type:sparksubmit" (assuming this is tagging structure at your organization) only instead of that and "ver:2.0.0". This way when versions 2.0.1 or 2.1.0 become available, the job moves along with the new default. Obviously if they do care about version they should set it or any other specific tag.

The clusterCriterias field is an array of ClusterCriteria objects. This is done to provide a fallback mechanism. If no match is found for the first ClusterCriteria and commandCriteria combination it will move onto the second and so on until all options are exhausted. This is handy if it is desirable to run a job on some cluster that is only up some of the time but other times it isn’t and its fine to run it on some other cluster that is always available.

Only clusters with status UP and commands with status ACTIVE will be considered during the selection process all others are ignored.

7.3.2.1.1. Cluster matching example

Say the following 3 clusters exists tagged as follows:

PrestoTestCluster: . sched:test . type:presto . ver:0.149

HadoopProdCluster: . sched:sla . type:yarn . ver:2.7.0 . ver:2.7

HadoopTestCluster: . sched:test . type:yarn . ver:2.7.1 . ver:2.7

Criteria Match Reason

Criteria	Match	Reason
`type:yarn, ver:2.7, sched:sla`	HadoopProdCluster	HadoopProdCluster satisfies all criteria
`type:yarn, ver:2.7`	HadoopProdCluster or HadoopTestCluster	Two clusters satisfy the criteria, a choice behavior is unspecified
`type:yarn, ver:2.7.1`	HadoopTestCluster	HadoopTestCluster satisfies all criteria
`type:presto, ver:0.150`	-	No cluster matches all criteria
`[type:presto, ver:0.150], [type:presto]`	PrestoTestCluster	The first criteria does not match any cluster, so fallback happens to the second, less restrictive criteria ("any presto cluster").

type:yarn, ver:2.7, sched:sla

HadoopProdCluster

HadoopProdCluster satisfies all criteria

type:yarn, ver:2.7

HadoopProdCluster or HadoopTestCluster

Two clusters satisfy the criteria, a choice behavior is unspecified

type:yarn, ver:2.7.1

HadoopTestCluster

HadoopTestCluster satisfies all criteria

type:presto, ver:0.150

No cluster matches all criteria

[type:presto, ver:0.150], [type:presto]

PrestoTestCluster

The first criteria does not match any cluster, so fallback happens to the second, less restrictive criteria ("any presto cluster").

7.3.2.2. User Submits a Job Request

There are other things a user needs to consider when submitting a job. All dependencies which aren’t sent as attachments must already be uploaded somewhere Genie can access them. Somewhere like S3, web server, shared disk, etc.

Users should familiarize themselves with whatever the executable for their desired command includes. It’s possible the system admin has setup some default parameters they should know are there so as to avoid duplication or unexpected behavior. Also they should make sure they know all the environment variables that may be available to them as part of the setup process of all the cluster, command and application setup processes.

7.3.2.3. Genie Receives the Job Request

When Genie receives the job request it does a few things immediately:

If the job request doesn’t have an id it creates a GUID for the job
It saves the job request to the database so it is recorded
1. If the ID is in use a 409 will be returned to the user saying there is a conflict
It creates job and job execution records in data base for consistency
It saves any attachments in a temporary location

Next Genie will attempt to find a cluster and command matching the requested tag combinations. If none is found it will send a failure back to the user and mark the job failed in the database.

If a combination is found Genie will then attempt to determine if the node can run the job. By this it means it will check the amount of client memory the job requires against the available memory in the Genie allocation. If there is enough the job will be accepted and will be run on this node and the jobs memory is subtracted from the available pool. If not it will be rejected with a 503 error message and user should retry later.

The amount of memory used by a job is not strictly enforced or even monitored. Such size is determined as follows:

Account for the amount requested in the job request (which must be below an admin-defined threshold)
If not provided in the request, use the number provided by the admins for the given command
If not provided in the command, use a global default set by the admins

Successful job submission results in a 202 message to the user stating it’s accepted and will be processed asynchronously by the system.

7.3.2.4. Genie Performs Job Setup

Once a job has been accepted to run on a Genie node, a workflow is executed in order to setup the job working directory and launch the job. Some minor steps left out for brevity.

Job is marked in INIT state in the database
A job directory is created under the admin configured jobs directory with a name of the job id
A run script file is created with the name run under the job working directory
1. Currently this is a bash script
Kill handlers are added to the run script
Directories for Genie logs, application files, command files, cluster files are created under the job working directory
Default environment variables are added to the run script to export their values
Cluster configuration files are downloaded and stored in the job work directory
Cluster related variables are written into the run script
Application configuration and dependency files are downloaded and stored in the job directory if any applications are needed
Application related variables are written into the run script
Cluster configuration and dependency files are downloaded and stored in the job directory
Command configuration and dependency files are downloaded and stored in the job directory
Command related variables are written into the run script
All job dependency files (including configurations, dependencies, attachments) are downloaded into the job working directory
Job related variables are written into the run script

7.3.2.5. Genie Launches and Monitors the Job Execution

Assuming no errors occurred during the setup, the job is launched.

Job run script is executed in a forked process.
Script pid stored in database job_executions table and job marked as RUNNING in database
Monitoring process created for pid

Once the job is running Genie will poll the PID periodically waiting for it to no longer be used.

Assumption made as to the amount of process churn on the Genie node. We’re aware PID’s can be reused but reasonably this shouldn’t happen within the poll period given the amount of available PID to the processes a typical Genie node will run.

Once the pid no longer exists Genie checks the done file for the exit code. It marks the job succeeded, failed or killed depending on that code.

7.3.2.6. Genie Performs Job Clean-Up

To save disk space Genie will delete application, cluster and command dependencies from the job working directory after a job is completed. This can be disabled by an admin. If the job is marked as it should be archived the working directory will be zipped up and stored in the default archive location as {jobId}.tar.gz.

7.3.3. User Behavior

Users can check on the status of their job using the status API and get the output using the output APIs. See the REST Documentation for specifics on how to do that.

7.3.4. Wrap Up

This section should have helped you understand how Genie works at a high level from configuration all the way to user job submission and monitoring. The design of Genie is intended to make this process repeatable and reliable for all users while not hiding any of the details of what is executed at job runtime.

7.4. Netflix Example

Understanding Genie without a concrete example is hard. This section attempts to walk through an end to end configuration and job execution example of a job at Netflix. To see more examples or try your own please see the Demo Guide. Also see the REST API documentation which will describe the purpose of the fields of the resources shown below.

This example contains JSON representations of resources.

7.4.1. Configuration

For the purpose of brevity we will only cover a subset of the total Netflix configuration.

7.4.1.1. Clusters

At Netflix there are tens of active clusters available at any given time. For this example we’ll focus on the production (SLA) and adhoc Hadoop YARN clusters and the production Presto cluster. For the Hadoop clusters we launch using Amazon EMR but it really shouldn’t matter how clusters are launched provided you can access the proper *site.xml files.

The process of launching a YARN cluster at Netflix involves a set of Python tools which interact with the Amazon and Genie APIs. First these tools use the EMR APIs to launch the cluster based on configuration files for the cluster profile. Then the cluster site XML files are uploaded to S3. Once this is complete all the metadata is sent to Genie to create a cluster resource which you can see examples of below.

Presto clusters are launched using Spinnaker on regular EC2 instances. As part of the pipeline the metadata is registered with Genie using the aforementioned Python tools, which in turn leverage the OSS Genie Python Client.

In the following cluster resources you should note the tags applied to each cluster. Remember that the genie.id and genie.name tags are automatically applied by Genie but all other tags are applied by the admin.

For the YARN clusters note that all the configuration files are referenced by their S3 locations. These files are downloaded into the job working directory at runtime.

7.4.1.1.1. Hadoop Prod Cluster

{
  "id": "bdp_h2prod_20161217_205111",
  "created": "2016-12-17T21:09:30.845Z",
  "updated": "2016-12-20T17:31:32.523Z",
  "tags": [
    "genie.id:bdp_h2prod_20161217_205111",
    "genie.name:h2prod",
    "sched:sla",
    "ver:2.7.2",
    "type:yarn",
    "misc:h2bonus3",
    "misc:h2bonus2",
    "misc:h2bonus1"
  ],
  "version": "2.7.2",
  "user": "dataeng",
  "name": "h2prod",
  "description": null,
  "setupFile": null,
  "configs": [
    "s3://bucket/users/bdp/h2prod/20161217/205111/genie/yarn-site.xml",
    "s3://bucket/users/bdp/h2prod/20161217/205111/genie/mapred-site.xml",
    "s3://bucket/users/bdp/h2prod/20161217/205111/genie/hdfs-site.xml",
    "s3://bucket/users/bdp/h2prod/20161217/205111/genie/core-site.xml"
  ],
  "dependencies": [],
  "status": "UP",
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/clusters/bdp_h2prod_20161217_205111"
    },
    "commands": {
      "href": "https://genieHost/api/v3/clusters/bdp_h2prod_20161217_205111/commands"
    }
  }
}

7.4.1.1.2. Hadoop Adhoc Cluster

{
  "id": "bdp_h2query_20161108_204556",
  "created": "2016-11-08T21:07:17.284Z",
  "updated": "2016-12-07T00:51:19.655Z",
  "tags": [
    "sched:adhoc",
    "misc:profiled",
    "ver:2.7.2",
    "sched:sting",
    "type:yarn",
    "genie.name:h2query",
    "genie.id:bdp_h2query_20161108_204556"
  ],
  "version": "2.7.2",
  "user": "dataeng",
  "name": "h2query",
  "description": null,
  "setupFile": "",
  "configs": [
    "s3://bucket/users/bdp/h2query/20161108/204556/genie/core-site.xml",
    "s3://bucket/users/bdp/h2query/20161108/204556/genie/hdfs-site.xml",
    "s3://bucket/users/bdp/h2query/20161108/204556/genie/mapred-site.xml",
    "s3://bucket/users/bdp/h2query/20161108/204556/genie/yarn-site.xml"
  ],
  "dependencies": [],
  "status": "UP",
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/clusters/bdp_h2query_20161108_204556"
    },
    "commands": {
      "href": "https://genieHost/api/v3/clusters/bdp_h2query_20161108_204556/commands"
    }
  }
}

7.4.1.1.3. Presto Prod Cluster

{
  "id": "presto-prod-v009",
  "created": "2016-12-05T19:33:52.575Z",
  "updated": "2016-12-05T19:34:14.725Z",
  "tags": [
    "sched:adhoc",
    "genie.id:presto-prod-v009",
    "type:presto",
    "genie.name:presto",
    "ver:0.149",
    "data:prod",
    "type:spinnaker-presto"
  ],
  "version": "1480966454",
  "user": "dataeng",
  "name": "presto",
  "description": null,
  "setupFile": null,
  "configs": [],
  "dependencies": [],
  "status": "UP",
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/clusters/presto-prod-v009"
    },
    "commands": {
      "href": "https://genieHost/api/v3/clusters/presto-prod-v009/commands"
    }
  }
}

7.4.1.2. Commands

Commands and applications at Netflix are handled a bit differently than clusters. The source data for these command and application resources are not generated dynamically like the cluster configuration files. Instead they are stored in a git repository as a combination of YAML, bash, python and other files. These configuration files are synced to an S3 bucket every time a commit occurs. This makes sure Genie is always pulling in the latest configuration. This sync is performed by a Jenkins job responding to a commit hook trigger. Also done in this Jenkins job is registration of the commands and applications with Genie via the same python tool set and Genie python client as with clusters.

Pay attention to the tags applied to the commands as they are used to select which command to use when a job is run. The presto command includes a setup file which allows additional configuration when it is used.

7.4.1.2.1. Presto 0.149

{
  "id": "presto0149",
  "created": "2016-08-08T23:22:15.977Z",
  "updated": "2016-12-20T23:28:44.678Z",
  "tags": [
    "genie.id:presto0149",
    "type:presto",
    "genie.name:presto",
    "ver:0.149",
    "data:prod",
    "data:test"
  ],
  "version": "0.149",
  "user": "builds",
  "name": "presto",
  "description": "Presto Command",
  "setupFile": "s3://bucket/builds/bdp-cluster-configs/genie3/commands/presto/0.149/setup.sh",
  "configs": [],
  "dependencies": [],
  "status": "ACTIVE",
  "executable": "${PRESTO_CMD} --server ${PRESTO_SERVER} --catalog hive --schema default --debug",
  "checkDelay": 5000,
  "memory": null,
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/commands/presto0149"
    },
    "applications": {
      "href": "https://genieHost/api/v3/commands/presto0149/applications"
    },
    "clusters": {
      "href": "https://genieHost/api/v3/commands/presto0149/clusters"
    }
  }
}

Presto 0.149 Setup File

#!/bin/bash

set -o errexit -o nounset -o pipefail

chmod 755 ${GENIE_APPLICATION_DIR}/presto0149/dependencies/presto-cli
export JAVA_HOME=/apps/bdp-java/java-8-oracle
export PATH=${JAVA_HOME}/bin/:$PATH

export PRESTO_SERVER="http://${GENIE_CLUSTER_NAME}.rest.of.url"
export PRESTO_CMD=${GENIE_APPLICATION_DIR}/presto0149/dependencies/presto-wrapper.py
chmod 755 ${PRESTO_CMD}

7.4.1.2.2. Spark Submit Prod 1.6.1

{
  "id": "prodsparksubmit161",
  "created": "2016-05-17T16:38:31.152Z",
  "updated": "2016-12-20T23:28:33.042Z",
  "tags": [
    "genie.id:prodsparksubmit161",
    "genie.name:prodsparksubmit",
    "ver:1.6",
    "type:sparksubmit",
    "data:prod",
    "ver:1.6.1"
  ],
  "version": "1.6.1",
  "user": "builds",
  "name": "prodsparksubmit",
  "description": "Prod Spark Submit Command",
  "setupFile": "s3://bucket/builds/bdp-cluster-configs/genie3/commands/spark/1.6.1/prod/scripts/spark-1.6.1-prod-submit-cmd.sh",
  "configs": [
    "s3://bucket/builds/bdp-cluster-configs/genie3/commands/spark/1.6.1/prod/configs/hive-site.xml"
  ],
  "dependencies": [],
  "status": "ACTIVE",
  "executable": "${SPARK_HOME}/bin/dsespark-submit",
  "checkDelay": 5000,
  "memory": null,
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/commands/prodsparksubmit161"
    },
    "applications": {
      "href": "https://genieHost/api/v3/commands/prodsparksubmit161/applications"
    },
    "clusters": {
      "href": "https://genieHost/api/v3/commands/prodsparksubmit161/clusters"
    }
  }
}

Spark Submit Prod 1.6.1 Setup File

#!/bin/bash

#set -o errexit -o nounset -o pipefail

export JAVA_HOME=/apps/bdp-java/java-8-oracle

#copy hive-site.xml configuration
cp ${GENIE_COMMAND_DIR}/config/* ${SPARK_CONF_DIR}
cp ${GENIE_COMMAND_DIR}/config/* ${HADOOP_CONF_DIR}/

7.4.1.2.3. Spark Submit Prod 2.0.0

{
  "id": "prodsparksubmit200",
  "created": "2016-10-31T16:59:01.145Z",
  "updated": "2016-12-20T23:28:47.340Z",
  "tags": [
    "ver:2",
    "genie.name:prodsparksubmit",
    "ver:2.0",
    "genie.id:prodsparksubmit200",
    "ver:2.0.0",
    "type:sparksubmit",
    "data:prod"
  ],
  "version": "2.0.0",
  "user": "builds",
  "name": "prodsparksubmit",
  "description": "Prod Spark Submit Command",
  "setupFile": "s3://bucket/builds/bdp-cluster-configs/genie3/commands/spark/2.0.0/prod/copy-config.sh",
  "configs": [
    "s3://bucket/builds/bdp-cluster-configs/genie3/commands/spark/2.0.0/prod/configs/hive-site.xml"
  ],
  "dependencies": [],
  "status": "ACTIVE",
  "executable": "${SPARK_HOME}/bin/dsespark-submit.py",
  "checkDelay": 5000,
  "memory": null,
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/commands/prodsparksubmit200"
    },
    "applications": {
      "href": "https://genieHost/api/v3/commands/prodsparksubmit200/applications"
    },
    "clusters": {
      "href": "https://genieHost/api/v3/commands/prodsparksubmit200/clusters"
    }
  }
}

Spark Submit 2.0.0 Setup File

#!/bin/bash

set -o errexit -o nounset -o pipefail

# copy hive-site.xml configuration
cp ${GENIE_COMMAND_DIR}/config/* ${SPARK_CONF_DIR}

7.4.1.3. Applications

Below are the applications needed by the above commands. The most important part of these applications are the dependencies and the setup file.

The dependencies are effectively the installation package and at Netflix typically are a zip of all binaries needed to run a client like Hadoop, Hive, Spark etc. Some of these zips are generated by builds and placed in S3 and others are downloaded from OSS projects and uploaded to S3 periodically. Often minor changes to these dependencies are needed. A new file is uploaded to S3 and the Genie caches on each node will be refreshed with this new file on next access. This pattern allows us to avoid upgrade of Genie clusters every time an application changes.

The setup file effectively is the installation script for the aforementioned dependencies. It is sourced by Genie and the expectation is that after it is run the application is successfully configured in the job working directory.

7.4.1.3.1. Hadoop 2.7.2

{
  "id": "hadoop272",
  "created": "2016-08-18T16:58:31.044Z",
  "updated": "2016-12-21T00:01:08.263Z",
  "tags": [
    "type:hadoop",
    "genie.id:hadoop272",
    "genie.name:hadoop",
    "ver:2.7.2"
  ],
  "version": "2.7.2",
  "user": "builds",
  "name": "hadoop",
  "description": "Hadoop Application",
  "setupFile": "s3://bucket/builds/bdp-cluster-configs/genie3/applications/hadoop/2.7.2/setup.sh",
  "configs": [],
  "dependencies": [
    "s3://bucket/hadoop/2.7.2/hadoop-2.7.2.tgz"
  ],
  "status": "ACTIVE",
  "type": "hadoop",
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/applications/hadoop272"
    },
    "commands": {
      "href": "https://genieHost/api/v3/applications/hadoop272/commands"
    }
  }
}

Hadoop 2.7.2 Setup File

#!/bin/bash

set -o errexit -o nounset -o pipefail

export JAVA_HOME=/apps/bdp-java/java-7-oracle
export APP_ID=hadoop272
export APP_NAME=hadoop-2.7.2
export HADOOP_DEPENDENCIES_DIR=$GENIE_APPLICATION_DIR/$APP_ID/dependencies
export HADOOP_HOME=$HADOOP_DEPENDENCIES_DIR/$APP_NAME

tar -xf "${HADOOP_DEPENDENCIES_DIR}/hadoop-2.7.2.tgz" -C "${HADOOP_DEPENDENCIES_DIR}"

export HADOOP_CONF_DIR="${HADOOP_HOME}/conf"
export HADOOP_LIBEXEC_DIR="${HADOOP_HOME}/usr/lib/hadoop/libexec"
export HADOOP_HEAPSIZE=1500

cp ${GENIE_CLUSTER_DIR}/config/* $HADOOP_CONF_DIR/

EXTRA_PROPS=$(echo "<property><name>genie.job.id</name><value>$GENIE_JOB_ID</value></property><property><name>genie.job.name</name><value>$GENIE_JOB_NAME</value></property><property><name>lipstick.uuid.prop.name</name><value>genie.job.id</value></property><property><name>dataoven.job.id</name><value>$GENIE_JOB_ID</value></property><property><name>genie.netflix.environment</name><value>${NETFLIX_ENVIRONMENT:-prod}</value></property><property><name>genie.version</name><value>$GENIE_VERSION</value></property><property><name>genie.netflix.stack</name><value>${NETFLIX_STACK:-none}</value></property>" | sed 's/\//\\\//g')

sed -i "/<\/configuration>/ s/.*/${EXTRA_PROPS}&/" $HADOOP_CONF_DIR/core-site.xml

if [ -d "/apps/s3mper/hlib" ]; then
export  HADOOP_OPTS="-javaagent:/apps/s3mper/hlib/aspectjweaver-1.7.3.jar ${HADOOP_OPTS:-}"
fi

# Remove the zip to save space
rm "${HADOOP_DEPENDENCIES_DIR}/hadoop-2.7.2.tgz"

7.4.1.3.2. Presto 0.149

{
  "id": "presto0149",
  "created": "2016-08-08T23:21:58.780Z",
  "updated": "2016-12-21T00:21:10.945Z",
  "tags": [
    "genie.id:presto0149",
    "type:presto",
    "genie.name:presto",
    "ver:0.149"
  ],
  "version": "0.149",
  "user": "builds",
  "name": "presto",
  "description": "Presto Application",
  "setupFile": "s3://bucket/builds/bdp-cluster-configs/genie3/applications/presto/0.149/setup.sh",
  "configs": [],
  "dependencies": [
    "s3://bucket/presto/clients/0.149/presto-cli",
    "s3://bucket/builds/bdp-cluster-configs/genie3/applications/presto/0.149/presto-wrapper.py"
  ],
  "status": "ACTIVE",
  "type": "presto",
  "_links": {
    "self": {
      "href": "https://genieProd/api/v3/applications/presto0149"
    },
    "commands": {
      "href": "https://genieProd/api/v3/applications/presto0149/commands"
    }
  }
}

Presto 0.149 Setup File

#!/bin/bash

set -o errexit -o nounset -o pipefail

chmod 755 ${GENIE_APPLICATION_DIR}/presto0149/dependencies/presto-cli
chmod 755 ${GENIE_APPLICATION_DIR}/presto0149/dependencies/presto-wrapper.py
export JAVA_HOME=/apps/bdp-java/java-8-oracle
export PATH=${JAVA_HOME}/bin/:$PATH

# Set the cli path for the commands to use when they invoke presto using this Application
export PRESTO_CLI_PATH="${GENIE_APPLICATION_DIR}/presto0149/dependencies/presto-cli"

7.4.1.3.3. Spark 1.6.1

{
  "id": "spark161",
  "created": "2016-05-17T16:32:21.475Z",
  "updated": "2016-12-21T00:01:07.951Z",
  "tags": [
    "genie.id:spark161",
    "type:spark",
    "ver:1.6",
    "ver:1.6.1",
    "genie.name:spark"
  ],
  "version": "1.6.1",
  "user": "builds",
  "name": "spark",
  "description": "Spark Application",
  "setupFile": "s3://bucket/builds/bdp-cluster-configs/genie3/applications/spark/1.6.1/scripts/spark-1.6.1-app.sh",
  "configs": [
    "s3://bucket/builds/bdp-cluster-configs/genie3/applications/spark/1.6.1/configs/spark-env.sh"
  ],
  "dependencies": [
    "s3://bucket/spark/1.6.1/spark-1.6.1.tgz"
  ],
  "status": "ACTIVE",
  "type": "spark",
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/applications/spark161"
    },
    "commands": {
      "href": "https://genieHost/api/v3/applications/spark161/commands"
    }
  }
}

Spark 1.6.1 Setup File

#!/bin/bash

set -o errexit -o nounset -o pipefail

VERSION="1.6.1"
DEPENDENCY_DOWNLOAD_DIR="${GENIE_APPLICATION_DIR}/spark161/dependencies"

# Unzip all the Spark jars
tar -xf ${DEPENDENCY_DOWNLOAD_DIR}/spark-${VERSION}.tgz -C ${DEPENDENCY_DOWNLOAD_DIR}

# Set the required environment variable.
export SPARK_HOME=${DEPENDENCY_DOWNLOAD_DIR}/spark-${VERSION}
export SPARK_CONF_DIR=${SPARK_HOME}/conf
export SPARK_LOG_DIR=${GENIE_JOB_DIR}
export SPARK_LOG_FILE=spark.log
export SPARK_LOG_FILE_PATH=${GENIE_JOB_DIR}/${SPARK_LOG_FILE}
export CURRENT_JOB_WORKING_DIR=${GENIE_JOB_DIR}
export CURRENT_JOB_TMP_DIR=${CURRENT_JOB_WORKING_DIR}/tmp

export JAVA_HOME=/apps/bdp-java/java-8-oracle
export SPARK_DAEMON_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"

# Make Sure Script is on the Path
export PATH=$PATH:${SPARK_HOME}/bin

# Delete the zip to save space
rm ${DEPENDENCY_DOWNLOAD_DIR}/spark-${VERSION}.tgz

Spark 1.6.1 Environment Variable File

#!/bin/bash

#set -o errexit -o nounset -o pipefail
export JAVA_HOME=/apps/bdp-java/java-8-oracle
export SPARK_DAEMON_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"

7.4.1.3.4. Spark 2.0.0

{
  "id": "spark200",
  "created": "2016-10-31T16:58:54.155Z",
  "updated": "2016-12-21T00:01:11.105Z",
  "tags": [
    "type:spark",
    "ver:2.0",
    "ver:2.0.0",
    "genie.id:spark200",
    "genie.name:spark"
  ],
  "version": "2.0.0",
  "user": "builds",
  "name": "spark",
  "description": "Spark Application",
  "setupFile": "s3://bucket/builds/bdp-cluster-configs/genie3/applications/spark/2.0.0/setup.sh",
  "configs": [],
  "dependencies": [
    "s3://bucket/spark-builds/2.0.0/spark-2.0.0.tgz"
  ],
  "status": "ACTIVE",
  "type": "spark",
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/applications/spark200"
    },
    "commands": {
      "href": "https://genieHost/api/v3/applications/spark200/commands"
    }
  }
}

Spark 2.0.0 Setup File

#!/bin/bash

set -o errexit -o nounset -o pipefail

start_dir=`pwd`
cd `dirname ${BASH_SOURCE[0]}`
SPARK_BASE=`pwd`
cd $start_dir

export JAVA_HOME=/apps/bdp-java/java-8-oracle
export SPARK_DAEMON_JAVA_OPTS="-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps"

SPARK_DEPS=${SPARK_BASE}/dependencies

export SPARK_VERSION="2.0.0"

tar xzf ${SPARK_DEPS}/spark-${SPARK_VERSION}.tgz -C ${SPARK_DEPS}

# Set the required environment variable.
export SPARK_HOME=${SPARK_DEPS}/spark-${SPARK_VERSION}
export SPARK_CONF_DIR=${SPARK_HOME}/conf
export SPARK_LOG_DIR=${GENIE_JOB_DIR}
export SPARK_LOG_FILE=spark.log
export SPARK_LOG_FILE_PATH=${GENIE_JOB_DIR}/${SPARK_LOG_FILE}
export CURRENT_JOB_WORKING_DIR=${GENIE_JOB_DIR}
export CURRENT_JOB_TMP_DIR=${CURRENT_JOB_WORKING_DIR}/tmp

# Make Sure Script is on the Path
export PATH=$PATH:${SPARK_HOME}/bin

# Delete the tarball to save space
rm ${SPARK_DEPS}/spark-${SPARK_VERSION}.tgz

chmod a+x ${SPARK_HOME}/bin/dsespark-submit.py

7.4.1.4. Relationships

Now that all the resources are available they need to be linked together. Commands need to be added to the clusters they can be run on and applications need to be added as dependencies for commands.

7.4.1.4.1. Commands for a Cluster

When commands are added to a cluster they should be in priority order. Meaning if two commands both match a users tags for a job the one higher in the list will be used. This allows us to switch defaults quickly and transparently.

Note: The lists below leave out a lot of commands and fields for brevity. Only the id of the command is included so it can reference the same command resource defined earlier in this article.

Hadoop Prod Cluster

The Hadoop clusters have both currently supported Spark versions added. Spark 1.6.1 is the default but users can override to Spark 2 using the ver tag.

Note: https://genieHost/api/v3/clusters/bdp_h2prod_20161217_205111/commands

[
  ...
  {
    "id": "prodsparksubmit161"
    ...
  },
  {
    "id": "prodsparksubmit200"
    ...
  }
  ...
]

Hadoop Adhoc Cluster

Note: https://genieHost/api/v3/clusters/bdp_h2query_20161108_204556/commands

[
  ...
  {
    "id": "prodsparksubmit161"
    ...
  },
  {
    "id": "prodsparksubmit200"
    ...
  }
  ...
]

Presto Prod Cluster

Presto clusters only really support the Presto command but possible that it could have multiple backwards compatible versions of the client available.

Note: https://genieHost/api/v3/clusters/presto-prod-v009/commands

[
  ...
  {
    "id": "presto0149"
    ...
  }
  ...
]

7.4.1.4.2. Applications for a Command

Linking applications to a command tells Genie that these applications need to be downloaded and setup in order to successfully run the command. The order of the applications will be the order the download and setup is performed so dependencies between applications should be managed via this order.

Presto 0.149

Presto only needs the corresponding Presto application which contains the Presto Java CLI jar and some setup wrapper scripts.

Note: https://genieHost/api/v3/commands/presto0149/applications

[
  {
    "id": "presto0149"
    ...
  }
]

Spark Submit Prod 1.6.1

Since we submit Spark jobs to YARN clusters in order to run the Spark submit commands we need both Spark and Hadoop applications installed and configured on the job classpath in order to run. Hadoop needs to be setup first so that the configurations can be copied to Spark.

Note: https://genieHost/api/v3/commands/prodsparksubmit161/applications

[
  {
    "id": "hadoop272"
    ...
  },
  {
    "id": "spark161"
    ...
  }
]

Spark Submit Prod 2.0.0

Note: https://genieHost/api/v3/commands/prodsparksubmit200/applications

[
  {
    "id": "hadoop272"
    ...
  },
  {
    "id": "spark200"
    ...
  }
]

7.4.2. Job Submission

Everything is now in place for users to submit their jobs. This section will walk through the components and outputs of that process. For clarity we’re going to show a PySpark job being submitted to show how Genie figures out the cluster and command to be used based on what was configured above.

7.4.2.1. The Request

Below is an actual job request (with a few obfuscations) made by a production job here at Netflix to Genie.

{
  "id": "SP.CS.FCT_TICKET_0054500815", (1)
  "created": "2016-12-21T04:13:07.244Z",
  "updated": "2016-12-21T04:13:07.244Z",
  "tags": [ (2)
    "submitted.by:call_genie",
    "scheduler.job_name:SP.CS.FCT_TICKET",
    "scheduler.run_id:0054500815",
    "SparkPythonJob",
    "scheduler.name:uc4"
  ],
  "version": "NA",
  "user": "someNetflixEmployee",
  "name": "SP.CS.FCT_TICKET",
  "description": "{\"username\": \"root\", \"host\": \"2d35f0d397fd\", \"client\": \"nflx-kragle-djinn/0.4.3\", \"kragle_version\": \"0.41.11\", \"job_class\": \"SparkPythonJob\"}",
  "setupFile": null,
  "commandArgs": "--queue root.sla --py-files dea_pyspark_core-latest.egg fct_ticket.py", (3)
  "clusterCriterias": [ (4)
    {
      "tags": [
        "sched:sla"
      ]
    }
  ],
  "commandCriteria": [ (5)
    "type:sparksubmit",
    "data:prod"
  ],
  "group": null,
  "disableLogArchival": false,
  "email": null,
  "cpu": null,
  "memory": null,
  "timeout": null,
  "configs": [],
  "dependencies": [ (6)
    "s3://bucket/DSE/etl_code/cs/ticket/fct_ticket.py",
    "s3://bucket/dea/pyspark_core/dea_pyspark_core-latest.egg"
  ],
  "applications": [],
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/request"
    },
    "job": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815"
    },
    "execution": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/execution"
    },
    "output": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/output"
    },
    "status": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/status"
    }
  }
}

Lets look at a few of the fields of note:

1	The user set the ID. This is a popular pattern in Netflix for tracking jobs between systems and reattaching to jobs.
2	The user added a few tags that will allow them to search for the job later. This is optional but convenient.
3	The user specifies some arguments to add to the default set of command arguments specified by the command `executable` field. In this case it’s what python file to run.
4	The user wants this job to run on any cluster that is labeled as having an SLA which also supports the command selected using the `commandCriteria`
5	User wants the default Spark Submit command (no version specified) and wants to be able to access production data
6	Here you can see that they add the two files referenced in the `commandArgs` as dependencies. These files will be downloaded in the root job directory parallel to the run script so they are accessible.

7.4.2.2. The Job

In this case the job was accepted by Genie for processing. Below is the actual job object containing fields the user might care about. Some are copied from the initial request (like tags) and some are added by Genie.

{
  "id": "SP.CS.FCT_TICKET_0054500815",
  "created": "2016-12-21T04:13:07.245Z",
  "updated": "2016-12-21T04:20:35.801Z",
  "tags": [
    "submitted.by:call_genie",
    "scheduler.job_name:SP.CS.FCT_TICKET",
    "scheduler.run_id:0054500815",
    "SparkPythonJob",
    "scheduler.name:uc4"
  ],
  "version": "NA",
  "user": "someNetflixEmployee",
  "name": "SP.CS.FCT_TICKET",
  "description": "{\"username\": \"root\", \"host\": \"2d35f0d397fd\", \"client\": \"nflx-kragle-djinn/0.4.3\", \"kragle_version\": \"0.41.11\", \"job_class\": \"SparkPythonJob\"}",
  "status": "SUCCEEDED", (1)
  "statusMsg": "Job finished successfully.", (2)
  "started": "2016-12-21T04:13:09.025Z", (3)
  "finished": "2016-12-21T04:20:35.794Z", (4)
  "archiveLocation": "s3://bucket/genie/main/logs/SP.CS.FCT_TICKET_0054500815.tar.gz", (5)
  "clusterName": "h2prod", (6)
  "commandName": "prodsparksubmit", (7)
  "runtime": "PT7M26.769S", (8)
  "commandArgs": "--queue root.sla --py-files dea_pyspark_core-latest.egg fct_ticket.py",
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815"
    },
    "output": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/output"
    },
    "request": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/request"
    },
    "execution": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/execution"
    },
    "status": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/status"
    },
    "cluster": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/cluster"
    },
    "command": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/command"
    },
    "applications": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/applications"
    }
  }
}

Some fields of note:

1	The current status of the job. Since this sample was taken after the job was completed it’s already marked SUCCESSFUL
2	This job was successful but if it failed for some reason a more human readable reason would be found here
3	The time this job was forked from the Genie process
4	The time Genie recognized the job as complete
5	Where Genie uploaded a zip of the job directory after the job was completed
6	The name of the cluster where this job ran and is de-normalized from the cluster record at the time
7	The name of the command used to run this job which is de-normalized from the command record at the time
8	The total run time in ISO8601

7.4.2.2.1. Cluster Selection

Because the user submitted with sched:sla this limits the clusters it can run on to any with that tag applied. In our example case only the cluster with ID bdp_h2prod_20161217_205111 has this tag. This isn’t enough to make sure this job can run (there also needs to be a matching command). If there had been multiple sla clusters Genie would consider them all equal and randomly select one.

7.4.2.2.2. Command Selection

The command criteria states that this job needs to run on a SLA cluster that supports a command of type prodsparksubmit that can access prod data. Two commands (prodsparksubmit161 and prodsparksubmit200) match this criteria. Both are linked to the cluster bdp_h2prod_20161217_205111. Since both match Genie selects the "default" one which is the first on in the list. In this case it was prodsparksubmit161.

7.4.2.3. The Job Execution

Below is the job execution resource. This is mainly for system and admin use but it can have some useful information for users as well. Mainly it shows which Genie node it actually ran on, how much memory it was allocated, how frequently the system polled it for status and when it would have timed out had it kept running.

{
  "id": "SP.CS.FCT_TICKET_0054500815",
  "created": "2016-12-21T04:13:07.245Z",
  "updated": "2016-12-21T04:20:35.801Z",
  "hostName": "a.host.com",
  "processId": 68937,
  "checkDelay": 5000,
  "timeout": "2016-12-28T04:13:09.016Z",
  "exitCode": 0,
  "memory": 1536,
  "_links": {
    "self": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/execution"
    },
    "job": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815"
    },
    "request": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/request"
    },
    "output": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/output"
    },
    "status": {
      "href": "https://genieHost/api/v3/jobs/SP.CS.FCT_TICKET_0054500815/status"
    }
  }
}

7.4.2.4. Job Output

Below is an image of the root of the job output directory (displayed via Genie UI) for the above job. Note that the dependency files are all downloaded there and some standard files are available (run, stdout, stderr).

The URI’s in this section point to the UI output endpoint however they are also available via the REST API and the UI is really calling this REST API to get the necessary information. Showing the UI endpoints for the better looking output and because most users will see this version.

https://genieHost/output/SP.CS.FCT_TICKET_0054500815/output

Click image for full size

7.4.2.4.1. The Run Script

Clicking into the run script shows the below contents. This run script is generated specifically for each individual job by Genie. It has some standard bits (error checking, exit process) but also specific information like environment variables and what to actually run. Everything is specific to the job working directory. In particular note all the GENIE_* environment variable exports. These can be used when building your setup and configuration scripts to be more flexible.

https://genieHost/output/SP.CS.FCT_TICKET_0054500815/output/run

#!/usr/bin/env bash

set -o nounset -o pipefail

# Set function in case any of the exports or source commands cause an error
trap "handle_failure" ERR EXIT

function handle_failure {
  ERROR_CODE=$?
  # Good exit
  if [[ ${ERROR_CODE} -eq 0 ]]; then
  exit 0
  fi
  # Bad exit
  printf '{"exitCode": "%s"}\n' "${ERROR_CODE}" > ./genie/genie.done
  exit "${ERROR_CODE}"
}

# Set function for handling kill signal from the job kill service
trap "handle_kill_request" SIGTERM

function handle_kill_request {

  KILL_EXIT_CODE=999
  # Disable SIGTERM signal for the script itself
  trap "" SIGTERM

  echo "Kill signal received"

  ### Write the kill exit code to genie.done file as exit code before doing anything else
  echo "Generate done file with exit code ${KILL_EXIT_CODE}"
  printf '{"exitCode": "%s"}\n' "${KILL_EXIT_CODE}" > ./genie/genie.done

  ### Send a kill signal the entire process group
  echo "Sending a kill signal to the process group"
  pkill -g $$

  COUNTER=0
  NUM_CHILD_PROCESSES=`pgrep -g ${SELF_PID} | wc -w`

  # Waiting for 30 seconds for the child processes to die
  while [[  $COUNTER -lt 30 ]] && [[ "$NUM_CHILD_PROCESSES" -gt 3 ]]; do
    echo The counter is $COUNTER
    let COUNTER=COUNTER+1
    echo "Sleeping now for 1 seconds"
    sleep 1
    NUM_CHILD_PROCESSES=`pgrep -g ${SELF_PID} | wc -w`
  done

  # check if any children are still running. If not just exit.
  if [ "$NUM_CHILD_PROCESSES" -eq 3  ]
  then
    echo "Done"
    exit
  fi

  ### Reaching at this point means the children did not die. If so send kill -9 to the entire process group
  # this is a hard kill and will this process itself as well
  echo "Sending a kill -9 to children"

  pkill -9 -g $$
  echo "Done"
}

SELF_PID=$$

echo Start: `date '+%Y-%m-%d %H:%M:%S'`

export GENIE_JOB_DIR="/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815"

export GENIE_APPLICATION_DIR="${GENIE_JOB_DIR}/genie/applications"

export GENIE_COMMAND_DIR="${GENIE_JOB_DIR}/genie/command/prodsparksubmit161"

export GENIE_COMMAND_ID="prodsparksubmit161"

export GENIE_COMMAND_NAME="prodsparksubmit"

export GENIE_CLUSTER_DIR="${GENIE_JOB_DIR}/genie/cluster/bdp_h2prod_20161217_205111"

export GENIE_CLUSTER_ID="bdp_h2prod_20161217_205111"

export GENIE_CLUSTER_NAME="h2prod"

export GENIE_JOB_ID="SP.CS.FCT_TICKET_0054500815"

export GENIE_JOB_NAME="SP.CS.FCT_TICKET"

export GENIE_JOB_MEMORY=1536

export GENIE_VERSION=3

# Sourcing setup file from Application: hadoop272
source ${GENIE_JOB_DIR}/genie/applications/hadoop272/setup.sh

# Sourcing setup file from Application: spark161
source ${GENIE_JOB_DIR}/genie/applications/spark161/spark-1.6.1-app.sh

# Sourcing setup file from Command: prodsparksubmit161
source ${GENIE_JOB_DIR}/genie/command/prodsparksubmit161/spark-1.6.1-prod-submit-cmd.sh

# Dump the environment to a env.log file
env | sort > ${GENIE_JOB_DIR}/genie/logs/env.log

# Kick off the command in background mode and wait for it using its pid
${SPARK_HOME}/bin/dsespark-submit --queue root.sla --py-files dea_pyspark_core-latest.egg fct_ticket.py > stdout 2> stderr &
wait $!

# Write the return code from the command in the done file.
printf '{"exitCode": "%s"}\n' "$?" > ./genie/genie.done
echo End: `date '+%Y-%m-%d %H:%M:%S'`

7.4.2.4.2. Genie Dir

Inside the output directory there is a genie directory. This directory is where Genie stores all the downloaded dependencies and any logs. Everything outside this directory is intended to be user generated other than the run script. Some commands or applications may put their logs in the root directory as well if desired (like spark or hive logs).

https://genieHost/output/SP.CS.FCT_TICKET_0054500815/output/genie

Click image for full size

Genie system logs go into the logs directory.

https://genieHost/output/SP.CS.FCT_TICKET_0054500815/output/genie/logs

Click image for full size

Of interest in here is the env dump file. This is convenient for debugging jobs. You can see all the environment variables that were available right before Genie executed the final command to run the job in the run script.

You can see this file generated in the run script above on this line:

# Dump the environment to a env.log file
env | sort > ${GENIE_JOB_DIR}/genie/logs/env.log

The contents of this file will look something like the below

https://genieHost/output/SP.CS.FCT_TICKET_0054500815/output/genie/logs/env.log

APP_ID=hadoop272
APP_NAME=hadoop-2.7.2
CURRENT_JOB_TMP_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/tmp
CURRENT_JOB_WORKING_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815
EC2_AVAILABILITY_ZONE=us-east-1d
EC2_REGION=us-east-1
GENIE_APPLICATION_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/applications
GENIE_CLUSTER_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/cluster/bdp_h2prod_20161217_205111
GENIE_CLUSTER_ID=bdp_h2prod_20161217_205111
GENIE_CLUSTER_NAME=h2prod
GENIE_COMMAND_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/command/prodsparksubmit161
GENIE_COMMAND_ID=prodsparksubmit161
GENIE_COMMAND_NAME=prodsparksubmit
GENIE_JOB_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815
GENIE_JOB_ID=SP.CS.FCT_TICKET_0054500815
GENIE_JOB_MEMORY=1536
GENIE_JOB_NAME=SP.CS.FCT_TICKET
GENIE_VERSION=3
HADOOP_CONF_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/applications/hadoop272/dependencies/hadoop-2.7.2/conf
HADOOP_DEPENDENCIES_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/applications/hadoop272/dependencies
HADOOP_HEAPSIZE=1500
HADOOP_HOME=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/applications/hadoop272/dependencies/hadoop-2.7.2
HADOOP_LIBEXEC_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/applications/hadoop272/dependencies/hadoop-2.7.2/usr/lib/hadoop/libexec
HOME=/home/someNetflixUser
JAVA_HOME=/apps/bdp-java/java-8-oracle
LANG=en_US.UTF-8
LOGNAME=someNetflixUser
MAIL=/var/mail/someNetflixUser
NETFLIX_ENVIRONMENT=prod
NETFLIX_STACK=main
PATH=/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin:/snap/bin:/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/applications/spark161/dependencies/spark-1.6.1/bin
PWD=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815
SHELL=/bin/bash
SHLVL=1
SPARK_CONF_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/applications/spark161/dependencies/spark-1.6.1/conf
SPARK_DAEMON_JAVA_OPTS=-verbose:gc -XX:+PrintGCDetails -XX:+PrintGCTimeStamps
SPARK_HOME=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/genie/applications/spark161/dependencies/spark-1.6.1
SPARK_LOG_DIR=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815
SPARK_LOG_FILE_PATH=/mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/spark.log
SPARK_LOG_FILE=spark.log
SUDO_COMMAND=/usr/bin/setsid /mnt/genie/jobs/SP.CS.FCT_TICKET_0054500815/run
SUDO_GID=60243
SUDO_UID=60004
SUDO_USER=genie
TERM=unknown
TZ=GMT
USER=someNetflixUser
USERNAME=someNetflixUser
_=/usr/bin/env

Finally inside the applications folder you can see the applications that were downloaded and configured.

https://genieHost/output/SP.CS.FCT_TICKET_0054500815/output/genie/applications/hadoop272/dependencies/hadoop-2.7.2

Click image for full size

https://genieHost/output/SP.CS.FCT_TICKET_0054500815/output/genie/applications/spark161/dependencies/spark-1.6.1

Click image for full size

7.4.3. Wrap Up

This section went over how Genie is configured by admins at Netflix and how how users submit jobs and retrieve their logs and output. Anyone is free to configure Genie however suits their needs in terms of tags and applications which are downloaded vs installed already on a Genie node but this method works for us here at Netflix.

7.5. Netflix Deployment

This section gives an overview of how Genie is configured and deployed at Netflix.

7.5.1. Genie OSS vs Genie Netflix

Genie leverages Spring Boot to enable customization for our internal deployment, allowing deep integration with the rest of the internal Netflix ecosystem.

Some examples of the additional component we use:

Eureka for service discovery and routing
Atlas for metrics
Archaius for dynamic configuration
Titus for job containers
And many other Netflix-internal components related to security, monitoring, observability, log indexing, etc.

We augment the open-source Genie components with internal ones by injecting the latter via Spring Boot, without having to change a line of upstream code.

This pattern works well for Genie server, client, and agent CLI, all of which get injected with additional Netflix components for internal use.

7.5.2. Deployment overview

Figure 3. Genie Netflix Deployment

7.5.2.1. EC2 instances and Auto Scaling Group (ASG)

The size of a the Genie cluster can be scaled up and down in response to load.

When a new build is deployed, a new ASG is created and the old one disabled. This allows continuous deployment without any user disruption.

The ASG is set to auto scale when the average amount of used job memory, that 80% of the system memory, exceeds 60% of the available.

7.5.2.2. Elastic Load Balancer

A TCP Elastic Load Balancer (ELB) sits in front of the Genie server instances. And it is assigned a mnemonic domain name via Route 53 entry.

All clients use this load balancer as single entry point: REST API clients talk HTTPS with a random instance in the cluster, and agents talk gRPC. The latter is true for both agents launched outside the cluster as CLI-initiated jobs, and agents launched by the cluster itself, in response to a REST API job request.

7.5.2.3. Relational Database (RDS)

We currently use an Amazon Aurora cluster.

Our Genie production cluster has a dedicated RDS, while the various test and development clusters share access to a different one.

The RDS only stores job metadata, which often requires strong consistency (e.g., avoid giving the same unique identifier to two jobs) and entities configurations.

7.5.2.4. Zookeeper

We use an Apache Zookeeper cluster for leadership election within our Genie ASG. The leader only performs background cleanup task for the entire cluster.

Zookeeper is also used for ephemeral routing information (which agent is currently connected to which server instance).

7.5.2.5. S3

Genie relies on S3 to store all data beside job metadata:

Job dependencies (binaries, libraries, configuration files, setup scripts, …)
Job outputs and logs archives
Job attachments

7.5.2.6. Titus

Each jobs submitted via REST API is launched into its own private Titus container.

The agent running inside the container connects back to the cluster over gRPC via the load balancer.

7.5.2.7. Spinnaker

Genie is deployed using Spinnaker. We currently have a few stacks (prod, test, dev, load, stage, etc) that we use for different purposes. Spinnaker handles all this for us after we configure the pipelines. See the Spinnaker site for more information.

7.5.3. Wrap Up

This section focused on how Genie is deployed within Netflix. Hopefully it helps bring clarity to a way that Genie can be deployed. Genie certainly is not limited to this deployment model and you are free to come up with your own this should only serve as a template or example.

8. Features

This section has documentation pertaining to specific features of Genie that may need specific examples.

8.1. Cluster Selection

Genie allows administrators to tag clusters with multiple tags that could reflect any type of domain modeling the admins wish to accomplish. Purpose of the cluster, types of data the cluster can access, workload expected are just some examples of ways tags could be used to classify clusters. Users submit jobs with a series of tags that help Genie identify which cluster to run a job on. Sometimes the tags submitted by the job match multiple clusters. At this point Genie needs a mechanism to chose a final runtime target from the set of clusters selected. This is where the cluster selection feature of Genie comes into play.

Genie has an interface ClusterSelector which can be implemented to provide plugin functionality of algorithms for cluster selection. The interface has a single method:

/**
 * Return best cluster to run job on.
 *
 * @param clusters   An immutable, non-empty list of available clusters to choose from
 * @param jobRequest The job request these clusters are being considered for
 * @return the "best" cluster to run job on or null if no cluster selected
 * @throws GenieException if there is any error
 */
@Nullable
Cluster selectCluster(final List<Cluster> clusters, final JobRequest jobRequest) throws GenieException;

At startup Genie will collect all beans that implement the ClusterSelector interface and based on their order store them in an invocation order list. Meaning that when multiple clusters are selected from the database based on tags Genie will send the list of clusters to the implementations in preference order until one of them returns the cluster to use.

Genie currently ships with two implementations of this interface which are described below.

8.1.1. RandomizedClusterSelectorImpl

As the name indicates this selector simply uses a random number generator to select a cluster from the list by index. There is no intelligence to this selection algorithm but it does provide very close to an equal distribution between clusters if the same tags are always used.

This implementation is the "default" implementation. It has the lowest priority order so if all other active implementations fail to select a cluster this implementation will be invoked and chose randomly.

8.1.2. ScriptClusterSelector

This implementation, first introduced in 3.1.0, allows administrators to provide a script to be invoked at runtime to decide which cluster to select. Currently JavaScript and Groovy are supported out of the box but others (like Python, Ruby, etc) could be supported by adding their implementations of the ScriptEngine interface to the Genie classpath.

8.1.2.1. Configuration

The script selector is disabled by default. To enable it an admin must set the property genie.scripts.cluster-selector.source to a valid script URI (e.g.: file:///myscript.js, classpath:org/my/package/myscript.groovy, s3://my-app/scripts/myscript.js). It is also recommended to set genie.scripts.cluster-selector.auto-load-enabled to true to enable eager loading of the script (default behavior is to load and compile lazily).

Other properties:

Property Description Default Value

Property	Description	Default Value
genie.scripts.cluster-selector.source	URI of the script to load. `ScriptClusterSelector` is enabled only if this property is set.	null
genie.scripts.cluster-selector.auto-load-enabled	If true, the script eagerly load during startup, as opposed to lazily load on first use.	false
genie.scripts.cluster-selector.timeout	Maximum script execution time (in milliseconds). After this time has elapsed, evaluation is shut down.	5000

genie.scripts.cluster-selector.source

URI of the script to load. ScriptClusterSelector is enabled only if this property is set.

null

genie.scripts.cluster-selector.auto-load-enabled

If true, the script eagerly load during startup, as opposed to lazily load on first use.

false

genie.scripts.cluster-selector.timeout

Maximum script execution time (in milliseconds). After this time has elapsed, evaluation is shut down.

5000

See also genie.scripts-manager.* properties, which affect this component.

8.1.2.2. Script Contract

The contract between the script and the Java code is as follows:

Table 2. Script Inputs
Parameter	Description
clusters	Non-empty JSON array of cluster objects (serialized as a string) of the clusters to be evaluated
jobRequest	JSON object (serialized as a string) of the job request that kicked off this evaluation

Table 3. Script Output
Result	Description
A string	The id of the cluster selected by the script algorithm that should be used to run the job
null	No cluster was selected and the evaluation should fall back to another selector algorithm

For most of the script engines the last statement will be the return value.

8.1.2.3. Script Examples

Some simple script examples

8.1.2.3.1. Javascript

var cJson = JSON.parse(clusters);
var jJson = JSON.parse(jobRequest);

var index;
for (index = 0; index < cJson.length; index++) {
    var cluster = cJson[index];
    if (cluster.user === "h") {
        break;
    }
}

index < cJson.length ? cJson[index].id : null;

8.1.2.3.2. Groovy

import groovy.json.JsonSlurper

def jsonSlurper = new JsonSlurper()
def cJson = jsonSlurper.parseText(clusters)
def jJson = jsonSlurper.parseText(jobRequest)

def index = cJson.findIndexOf {
    cluster -> cluster.user == "h"
}

index == -1 ? null : cJson[index].id

8.1.2.4. Caveats

The script selector provides great flexibility for system administrators to test algorithms for cluster load balancing at runtime. Since the script is refreshed periodically it can even be changed after Genie is running. With this flexibility comes the trade off that script evaluation is going to be slower than code running direct JVM byte code. The selector tries to offset this by compiling and caching the script code in between refresh invocations. It is recommended that once an algorithm is well tested it be converted to a true implementation of the ClusterSelector interface if performance is desired.

Additionally if a script error is made the ScriptClusterSelector will swallow the exceptions and simply return null from all calls to selectCluster until the script is fixed and refresh is invoked again. The metric genie.jobs.clusters.selectors.script.select.timer with tag status and value failed can be used to monitor this situation.

Two more metrics are relevant in this context. genie.scripts.load.timer timer for scripts loading (and reloading) can also be used to monitor unavailable resources, compilation errors, etc. genie.scripts.evaluate.timer timer for script evaluation, can also be used to monitor evaluation errors, timeouts, etc. Both metrics are tagged with scriptUri in case multiple scripts are loaded.

8.1.3. Wrap Up

This section went over the cluster selection feature of Genie. This interface provides an extension point for administrators of Genie to tweak Genie’s runtime behavior to suit their needs.

9. Internals

This section provides an overview of Genie subsystems and components.

9.1. Leader election

Genie uses Zookeeper to elect an instance to be leader. An alternative leader election mechanism can be plugged in instead of Zookeeper.

9.1.1. Leader tasks

The Genie instance elected leader of a given cluster performs different tasks:

Trimming old job records, disabling of unused resources
Mark jobs whose agent is AWOL as failed
Collect and publish user metrics about running jobs

9.2. Execution stages

Every job is executed by a dedicated agent. This is true for both API-submitted jobs and CLI-initiated jobs.

An agent executing a job goes through the following stages:

Initialize - Gather metadata about the agent itself and its environment
Handshake - Connect to the Genie cluster and provide agent metadata
- This allows the cluster to turn away an agent that is too old, or running from an unsupported environment
Configure agent - Download additional properties from the server
- This allows the server to provide configuration, for example set a different network timeout, or change log levels
Configure execution - Configure the rest of the execution stages depending on the command-line arguments
- For example, executing a pre-resolved job submitted via API vs. compose a job request for a CLI job
Reserve job - If executing a CLI job, save the job request obtained from command-line arguments
Obtain job specification
- For a CLI job, the agent asks the server to resolve the job request into a specification
- For an API job, the agent downloads the pre-resolved job specification
Create job directory
Relocate the agent log - From a temporary location to the job directory, so it’s accessible during execution and archived afterwards
Claim job - Atomically claim a job so no other agents will be able to
Start heartbeat - The agent needs to constantly tell the Genie cluster that it is still active
Register for kill notifications - Allowing a user to kill a job via API
Start files service - Allowing a user to download agent logs and outputs during execution
Update job status to INIT
Create the job run script inside the job directory
Download job dependencies, attachments, configurations, setup scripts inside the job directory
Launch and monitor the job run script
Update job status to RUNNING
Monitor the job process until it completes
Determine the job final status (SUCCEEDED, FAILED, KILLED)
Update the final job status
Unregister for kill notifications
Archive the job logs and output to long-term storage
Stop heart-beating
Stop serving files (allowing some time for in-progress transfers to complete)
Clean the job directory (removing redundant files such as libraries and binaries, to avoid wasting space on disk)
Shutdown
- The agent exit code reflects the job final status

For a successful job, these steps are executed linearly.

However, when something goes wrong, stages may be skipped. For example, if an agent is rejected by the cluster during the Handshake stage, execution is aborted, and most of the following stages are skipped.

Some stages may fail due to transient issues, in which case they are retried (for example, Handshake may fail due to a broken connection, but a retry may succeed). If a retryable stage runs out of attempts, then execution is aborted, which may skip following stages as appropriate.

9.3. The `run` script

Every job consists of a bash script that the agent composes, forks, and monitors.

This separation allows treating every job the same way, whether they are Spark or Notebook kernels, or anything else.

The structure of a run script is as follows:

bash options
Signal traps, to ensure clean shutdown and handing of ctrl-c for CLI jobs
Declaration of environment variables
- These can be used by setup scripts and by the command itself
Sourcing of setup script (if the selected entities provide them)
- For example, the setup for Hadoop may unpack the spark libraries, or set additional environment variables
- The order of setup is: cluster, applications (in order defined by command), command
- Setup output is piped to a dedicated log file
Dump of environment variables
- Sensitive environment variables can be filtered, or the dumping can be disabled completely
Launch the assembled job command
Wait for command termination

9.4. Agent heartbeat and routing

Each agent that successfully claimed a job maintains a gRPC connection to a server in the cluster and periodically sends heartbeats.

It does not matter which server instance an agent is connected to. If the gRPC endpoint is fronted by a TCP load-balancer, upon disconnection the agent will reconnect to a different instance.

The server instances use heartbeat to know which clients are connected to each instance. This information is stored in the agent routing table. The routing table is used by Genie server instances to dispatch a kill request to the instance where the agent is currently connected. That instance then forwards the request to the agent.

9.5. Agent file streaming

Each agent executing a job opens multiple streaming channels on top of a single connection to a server. These channels can be used to pull a file from the agent and serve it via REST API

9.6. Agent logging

Log messages emitted by a genie agent are routed to agent.log. This file is created at a temporary location when the agent starts, and relocated to the job folder once the latter exists.

Some messages (such as fatal errors and high level execution progress updates) are sent to a special logger that displays them in the console. These messages are appended to stderr to avoid polluting the job command output (sent to stdout) such as query outputs.

9.7. Interactive mode

Agent execute by default in non-interactive mode:

Job standard error is appended to a stderr file
Job standard output is appended to a stdout file
Standard input is ignored

Hence, the agent and the child job process do not emit anything to console.

Commands that launch REPL interfaces (e.g., spark-shell), should be launched via CLI with the --interactive flag.

Job standard error and output go directly to the shell
Standard input is received by the job process directly

All jobs launched by the server (as a result of an API job submission) execute in non-interactive mode.

9.8. Persistence

Out of the box, Genie supports MySQL and Postgres persistence via JPA and Hibernate.

H2 in-memory database is also supported for integration tests.

Other kinds of persistence can be plugged in either via JPA or by re-implementing the persistence interface with a different technology.

9.9. Plug-in Selectors

Genie provides a mechanism to load and invoke Groovy scripts, and even reload the script if it changes at runtime.

These hooks are provided in order to plug-in custom logic:

9.9.1. Command selector script

A job command criteria may be broad enough to match multiple commands (for example, a Spark job may match 10 different Spark commands, one for each Spark version).

The command selector allows custom plug-in logic to choose the most appropriate.

For example, there may be a 'stable' version which should be picked as default. Or the job may be a good candidate to "canary" the newest and unstable version.

9.9.2. Cluster selector script

Clusters that are eligible to run a given job are filtered down by the command’s cluster criteria combined with the job’s cluster criteria. But the final match set may still contain more than one viable cluster.

In this kind of situation, the selector script can make advanced decision that go beyond Genie’s scope.

For example, the script could poll the clusters and see which one is the least loaded. Or route the job based on other attributes or metadata.

9.9.3. Agent launcher selector script

Genie supports multiple agent launcher. For example, a launcher that starts the job on bare metal on in the same host, a launcher that starts the job inside a Docker container, a launcher that delegates the job to something like Kubernetes.

Like in previous examples, making this runtime decisions is beyond the scope of Genie itself.

Therefore a hook is provided to plug-in a Groovy selector script that can make this decision based on job metadata and pick one of the available launchers.

9.10. Job State Change Notifications

Genie emits internal application events whenever a job changes state.

Genie includes 2 built-in components that consume theese events and publish notifications to Amazon SNS.

9.11. User limits

Genie can cap the maximum number of jobs that a given user is allowed to run.

The global default can be adjusted, and overrides for individual users can be set higher or lower.

9.12. Job attachments

Genie allows API job requests to submit one or more attachments (by submitting a multi-part POST). These files are placed in the job directory before launch.

Example usage include Presto scripts, and Spark application jars, and additional property files.

9.13. Job archival

After a job is completed, successfully or not, job file outputs and logs can be archived for later access.

The set of files that get archived is configurable. Or it can be disabled completely.

10. Installation

Installing Genie is easy. You can run Genie either as a standalone application with an embedded Tomcat or by deploying the WAR file to an existing Tomcat or other servlet container. There are trade-offs to these two methods as will be discussed below.

10.1. Standalone Jar

The standalone jar is the simplest to deploy as it has no other real moving parts. Just put the jar somewhere on a system and execute java -jar genie-app-4.0.0-SNAPSHOT.jar. The downside is it’s a little harder to configure or add jars to the classpath if you want them.

Configuration (application*.yml or application*.properties) files can be loaded from the current working directory or from a .genie/ directory stored in the users home directory (e.g. ~/.genie/application.yml). Classpath items (jars, .jks files, etc) can be added to ~/.genie/lib/ and they will be part of the application classpath.

Properties can be passed in on the command line two ways:

java -Dgenie.example.property blah -jar genie-app-4.0.0-SNAPSHOT.jar
java -jar genie-app-4.0.0-SNAPSHOT.jar --genie.example.property=blah

Property resolution goes in this order:

Command line
Classpath profile specific configuration files (e.g. application-prod.yml)
Embedded profile specific configuration files
Classpath default configuration file (e.g. application.yml)
Embedded default configuration file

For more details see the Spring Boot documentation on external configuration.

10.2. Servlet Container Deployment

If you want to deploy to an existing Servlet container deployment you can re-package the genie-web jar inside a WAR. You can see the Spring Boot docs for an example of how to do this.

10.3. Configuration

Genie has a lot of available configuration options. For descriptions of specific properties you can see the Properties section below. Additionally if you want to know how to configure more parts of the application you should have a look at the Spring Boot docs as they will go in depth on how to configure the various Spring components used in Genie.

10.3.1. Profiles

Spring provides a mechanism of segregating parts of application configuration and activating them in certain conditions. This mechanism is known as profiles. By default Genie will run with the dev profile activated. This means that all the properties in application-dev.yml will be appended to, or overwrite, the properties in application.yml which are the defaults. Changing the active profiles is easy you just need to change the property spring.profiles.active and pass in a comma separated list of active profiles. For example --spring.profiles.active=prod,cloud would activate the prod and cloud profiles.

Properties for specific profiles should be stored in files named application-{profileName}.yml. You can make as many as you want but Genie ships with dev, s3 and prod profiles properties already included. Their properties can be seen in the Properties section below.

10.3.2. Database

By default since Genie will launch with the dev profile active it will launch with an in memory database running as part of its process. This means when you shut Genie down all data will be lost. It is meant for development only. Genie ships with JDBC drivers for MySql, PostgreSQL and H2. If you want to use a different database you should load the JDBC driver jar file somewhere on the Genie classpath.

For production you should probably enable the prod profile which creates a connection pool for the database and then override the properties spring.datasource.url, spring.datasource.username and spring.datasource.password to match your environment. The datasource url needs to be a valid JDBC connection string for your database. You can see examples here or here or search for your database and JDBC connection string on your search engine of choice.

Genie also ships with database schema scripts for MySQL and PostgreSQL. You will need to load these into your database before you run Genie if you use one of these databases. Genie no longer creates the schema dynamically for performance reasons. Follow the below sections to load the schemas into your table.

Genie 3.2.0+ software is not compatible with previous database schema. Before upgrading existing Genie servers to 3.2.0, follow the steps below to perform database upgrade, or create a new database with 3.1.x schema. Database upgrades beyond 3.2.0 are handled automatically by the Genie binary via Flyway.

10.3.2.1. MySQL

This assumes the MySQL client binaries are installed

Genie requires MySQL 5.6.3+ due to certain properties not existing before that version

Ensure the following properties are set in your my.cnf:

[mysqld]
innodb_file_per_table=ON
innodb_large_prefix=ON
innodb_file_format=barracuda

Restart MySQL if you’ve changed these properties

Run:

mysql -u {username} -p{password} -h {host} -e 'create database genie;'

10.3.2.1.1. 3.1.x to 3.2.0 database upgrade

Genie requires MySQL 5.6.3+ due to certain properties not existing before that version

If you have an existing Genie installation on a database version < 3.2.0 you’ll need to upgrade your schema to 3.2.0 before continuing.

Download the: 3.1.x to 3.2.0 Schema Upgrade.

Ensure the following properties are set in your my.cnf:

[mysqld]
innodb_file_per_table=ON
innodb_large_prefix=ON
innodb_file_format=barracuda

Restart MySQL if you’ve changed these properties

Then run:

mysql -u {username} -p{password} -h {host} genie < upgrade-3.1.x-to-3.2.0.mysql.sql

10.3.2.2. PostgreSQL

This assumes the PSQL binaries are installed

Run:

createdb genie

10.3.2.2.1. 3.1.x to 3.2.0 database upgrade

If you have an existing Genie installation on a database version < 3.2.0 you’ll need to upgrade your schema to 3.2.0 before continuing.

Download the 3.1.x to 3.2.0 Schema Upgrade.

Then run:

psql -U {user} -h {host} -d genie -f upgrade-3.1.x-to-3.2.0.postgresql.sql

10.3.3. Local Directories

Genie requires a few directories to run. By default Genie will place them under /tmp however in production you should probably create a larger directory you can store the job working directories and other places in. These correspond to the genie.jobs.locations.* properties described below in the Properties section.

10.3.3.1. S3

If your commands, applications, or jobs depend on artifacts referenced via S3 URI, you will need to configure the S3 subsystem. If you’re not assuming a role there is nothing you necessarily have to do provided a default credentials provider chain can be created. See here for the rules for that.

If you need to assume a order to access Amazon resources from your Genie node set the property genie.aws.credentials.role to the ARN of the role you’d like to assume. This will force Genie to create a STSAssumeRoleSessionCredentialsProvider instead of the default one.

Example role setting:

genie:
  aws:
    credentials:
      role: <AWS ROLE ARN>

10.4. Wrap Up

This section contains the basic setup instructions for Genie. There are other components that can be added to the system like Redis, Zookeeper and Security systems that are somewhat outside the scope of an initial setup. You can see the Properties section below for the properties you’d need to configure for these systems.

11. Properties

12. Genie Web

This section describes the various properties that can be set to control the behavior of your Genie node and cluster. For more information on Spring properties you should see the Spring Boot reference documentation and the Spring Cloud documentation. The Spring properties described here are ones that we have overridden from Spring defaults.

12.1. Default Properties

12.1.1. Genie Properties

Properties marked 'dynamic' reflect change of property value in the environment happening at runtime. Whereas static properties values are bound during application startup and do not change after the application is up and running.

Property Description Default Value Dynamic

Property	Description	Default Value	Dynamic
genie.agent.connection-tracking.cleanup-interval	Interval at which the cleanup task runs	2s	no
genie.agent.connection-tracking.connection-expiration-period	How long after the last heartbeat an agent connection is marked expired	10s	no
genie.agent.configuration.cache-refresh-interval	Interval for after which the agent properties cache is refreshed	1m	no
genie.agent.filestream.max-concurrent-transfers	Maximum number of concurrent file transfers that a server allows	100	no
genie.agent.filestream.unclaimed-stream-start-timeout	Interval after which a transfer stream is shut down if it didn’t send the first chunk of data	10s	no
genie.agent.filestream.stalled-transfer-timeout	Interval after which a transfer stream is shut down if it didn’t send any more data	20s	no
genie.agent.filestream.stalled-transfer-check-interval	Interval for checking on stalled downloads	5s	no
genie.agent.filestream.write-retry-delay	Interval between attempts to write data into a stream buffer	300ms	no
genie.agent.filter.enabled	If set to `true`, enables the built-in agent filter service. The filter behavior is controlled by other active `genie.agent.filter.*` properties.		no
genie.agent.filter.version.blacklist	A regex matched against agent version (e.g., `1\.2\..*`), matching agents are rejected. The filter needs to be enabled for this to take effect.		yes
genie.agent.filter.version.minimum	The minimum version an agent needs to be (e.g., `1.2.3`) in order to communicate with this server. The filter needs to be enabled for this to take effect.		yes
genie.agent.filter.version.whitelist	A regex matched against agent version (e.g., `1\.2\..*`), agents not matching are rejected. The filter needs to be enabled for this to take effect.		yes
genie.agent.heart-beat.send-interval	Interval for sending heartbeats to all connected clients.	5s	no
genie.agent.launcher.local.additional-environment	Environment variables to set when spawning an agent (in addition to the inherited server environment)		no
genie.agent.launcher.local.agent-jar-path	The location of the agent jar. The value is substituted in the command template if the corresponding placeholder is present.	/tmp/genie-agent.jar	no
genie.agent.launcher.local.enabled	Enable or disable the corresponding launcher.	true	no
genie.agent.launcher.local.host-info-expire-after	How long after the job information for this host is written into a local cache is it evicted. See Spring Docs for Duration conversion details.	1m	no
genie.agent.launcher.local.host-info-refresh-after	How long after the job information for this host is written should it be automatically refreshed from the underlying data source. See Spring Docs for Duration conversion details.	30s	no
genie.agent.launcher.local.launch-command-template	The system command that the launcher should use to launch an agent process. Ordered list of arguments. Contains placeholders that will be replaced at runtime.	java -jar <AGENT_JAR_PLACEHOLDER> exec --server-host 127.0.0.1 --server-port <SERVER_PORT_PLACEHOLDER> --api-job --job-id <JOB_ID_PLACEHOLDER>	no
genie.agent.launcher.local.max-job-memory	The maximum amount of memory, in megabytes, that a job can be allocated while using the local launcher	10240	no
genie.agent.launcher.local.max-total-job-memory	The total number of MB out of the system memory that Genie can use for running agents	30720	no
genie.agent.launcher.local.process-output-capture-enabled	Whether to capture stdout and stderr from the forked agent subprocess to a file for debugging purposes	false	no
genie.agent.launcher.local.run-as-user-enabled	Whether to launch the agent subprocess as the user specified in the job request	false	no
genie.agent.launcher.local.server-hostname	The hostname to connect to. This value is used if the command-line template references the server hostname placeholder	127.0.0.1	no
genie.agent.launcher.titus.application-name	The name of the application that launches the Titus job	genie	no
genie.agent.launcher.titus.additional-bandwidth	An additional amount of overhead bandwidth that should be added to whatever the job originally requested	0B	yes
genie.agent.launcher.titus.additional-cpu	An additional number of CPUs that should be added to whatever the job originally requested	1	yes
genie.agent.launcher.titus.additional-disk-size	An additional amount of disk space that should be added to whatever the job originally requested	1GB	yes
genie.agent.launcher.titus.additional-environment	Map of additional environment variables to set for the job	empty	yes
genie.agent.launcher.titus.additional-gpu	An additional number of GPUs that should be added to whatever the job originally requested	0	yes
genie.agent.launcher.titus.additional-job-attributes	Map of attributes to send with the Titus request in addition to whatever defaults there are	empty	yes
genie.agent.launcher.titus.additional-memory	An additional amount of memory that should be added to whatever the job originally requested	2GB	yes
genie.agent.launcher.titus.capacity-group	The name of the capacity group to target	default	yes
genie.agent.launcher.titus.command-template	The container command array, placeholder values are substituted at runtime	[exec, --api-job, --launchInJobDirectory, --job-id, <JOB_ID>, --server-host, <SERVER_HOST>, --server-port, <SERVER_PORT>]	no
genie.agent.launcher.titus.container-attributes	Map attributes to send to Titus specific to the container	empty	yes
genie.agent.launcher.titus.detail	The detail (jobGroupInfo) within the application space for Titus request	empty string	no
genie.agent.launcher.titus.enabled	Whether the Titus agent launcher is active or not	false	no
genie.agent.launcher.titus.endpoint	The Titus API endpoint URI	https://example-titus-endpoint.tld:1234	no
genie.agent.launcher.titus.entry-point-template	The container entry point array, placeholder values are substituted at runtime	[/bin/genie-agent]	no
genie.agent.launcher.titus.genie-server-host	The hostname of the Genie server or cluster for the agent to connect to	example.genie.tld	no
genie.agent.launcher.titus.genie-server-port	The port number of the Genie server or cluster for the agent to connect to	9090	no
genie.agent.launcher.titus.healthIndicator-max-size	Maximum number of recently launched jobs displayed in the health indicator details	100	no
genie.agent.launcher.titus.health-indicator-expiration	Maximum time a job is displayed in the health indicator details after launch	30m	no
genie.agent.launcher.titus.i-am-role	The IAM role to launch the job as	arn:aws:iam::000000000:role/SomeProfile	no
genie.agent.launcher.titus.image-name	The name of the container image	image-name	yes
genie.agent.launcher.titus.image-tag	The image tag	latest	yes
genie.agent.launcher.titus.minimum-bandwidth	The minimum network bandwidth to allocate to the container	7MB	yes
genie.agent.launcher.titus.minimum-cpu	The minimum number of CPUs any container should launch with	1	yes
genie.agent.launcher.titus.minimum-disk-size	The minimum size of the disk volume to attach to the job container	10GB	yes
genie.agent.launcher.titus.minimum-gpu	The minimum number of GPUs any container should launch with	0	yes
genie.agent.launcher.titus.minimum-memory	The minimum amount of memory a container should be allocated	2GB	yes
genie.agent.launcher.titus.owner-email	The team email of the owners of the Titus job	owners@genie.tld	no
genie.agent.launcher.titus.retries	How many times to retry launch if the job fails	0	yes
genie.agent.launcher.titus.runtime-limit	The maximum runtime of the job	12H	no
genie.agent.launcher.titus.security-attributes	A map of security attributes	empty map	no
genie.agent.launcher.titus.security-groups	A list of security groups for the job	empty list	no
genie.agent.launcher.titus.sequence	The sequence (jobGroupInfo) within the application space for Titus request	empty string	no
genie.agent.launcher.titus.stack	The stack (jobGroupInfo) within the application space for Titus request	empty string	no
genie.agent.routing.refresh-interval	Interval at which individual connections are refreshed	3s	no
genie.agent.configuration.dynamic.*	Properties with this prefix are forwarded to each agent during startup (with the prefix stripped)		yes
genie.aws.credentials.role	The AWS role ARN to assume when connecting to S3. If this is set Genie will create a credentials provider that will attempt to assume this role on the host Genie is running on		no
genie.aws.s3.buckets.[bucketName].roleARN	For the bucket with name `bucketName` the ARN of the role to assume to read/write to that bucket. If a bucket is used that isn’t listed in this map the default credentials configured will be used		no
genie.aws.s3.buckets.[bucketName].region	The AWS region the bucket with `bucketName` is in. Clients to talk to this bucket will be created in this region. If no value is specified the bucket is assumed to be in the same region as the Genie process.		no
genie.grpc.server.services.job-file-sync.ackIntervalMilliseconds	How many milliseconds to wait between checks whether some acknowledgement should be sent to the agent regardless of whether the `maxSyncMessages` threshold has been reached or not	30,000	no
genie.grpc.server.services.job-file-sync.maxSyncMessages	How many messages to receive from the agent before an acknowledgement message is sent back from the server	10	no
genie.health.maxCpuLoadConsecutiveOccurrences	Defines the threshold of consecutive occurrences of CPU load crossing the <maxCpuLoadPercent>. Health of the system is marked unhealthy if the CPU load of a system goes beyond the threshold 'maxCpuLoadPercent' for 'maxCpuLoadConsecutiveOccurrences' consecutive times.	3	no
genie.health.maxCpuLoadPercent	Defines the threshold for the maximum CPU load percentage to consider for an instance to be unhealthy. Health of the system is marked unhealthy if the CPU load of a system goes beyond this threshold for 'maxCpuLoadConsecutiveOccurrences' consecutive times.	80	no
genie.http.connect.timeout	The number of milliseconds before HTTP calls between Genie nodes should time out on connection	2000	no
genie.http.read.timeout	The number of milliseconds before HTTP calls between Genie nodes should time out on attempting to read data	10000	no
genie.jobs.active-limit.count	The maximum number of active jobs a user is allowed to have. Once a user hits this limit, jobs submitted are rejected. This is property is ignored unless `genie.jobs.users.active-limit.enabled` is set to true. This limit applies to users that don’t have an override set via `genie.jobs.users.active-limit.overrides.<user-name>`.	100	no
genie.jobs.active-limit.enabled	Enables the per-user active job limit. The number of jobs is controlled by the `genie.jobs.users.active-limit.count` property.	false	no
genie.jobs.active-limit.overrides.<user-name>	The maximum number of active jobs that user 'user-name' is allowed to have. This is property is ignored unless `genie.jobs.users.active-limit.enabled` is set to true.	-	yes
genie.jobs.attachments.location-prefix	Common prefix where attachments are stored	s3://genie/attachments	no
genie.jobs.attachments.max-size	Maximum size of an attachment	100MB	no
genie.jobs.attachments.max-total-size	Maximum size of all attachments combined (Spring and Tomcat may also independently limit the size of upload)	150MB	no
genie.jobs.files.filter.case-sensitive-matching	Whether the regular expressions defined in `genie.jobs.files.filter.*` are case-sensitive or not.	true	no
genie.jobs.files.filter.directory-traversal-reject-patterns	List of regex patterns, if a directory matches any, then its contents are not included in the job files manifest	[]	no
genie.jobs.files.filter.directory-reject-patterns	List of regex patterns, if a directory matches any, then it is not included in the job files manifest	[]	no
genie.jobs.files.filter.file-reject-patterns	List of regex patterns, if a file matches any, then it is not included in the job files manifest	[]	no
genie.jobs.forwarding.enabled	Whether to attempt to forward kill and get output requests for jobs	true	no
genie.jobs.forwarding.port	The port to forward requests to as it could be different than ELB port	8080	no
genie.jobs.forwarding.scheme	The connection protocol to use (http or https)	http	no
genie.jobs.locations.archives	The default root location where job archives should be stored. Scheme should be included. Created if doesn’t exist.	file://${java.io.tmpdir}genie/archives/	no
genie.jobs.locations.jobs	The default root location where job working directories will be placed. Created by system if doesn’t exist.	file://${java.io.tmpdir}genie/jobs/	no
genie.jobs.memory.maxSystemMemory	The total number of MB out of the system memory that Genie can use for running jobs	30720	no
genie.jobs.memory.defaultJobMemory	The total number of megabytes Genie will assume a job is allocated if not overridden by a command or user at runtime	1024	no
genie.jobs.memory.maxJobMemory	The maximum amount of memory, in megabytes, that a job client can be allocated	10240	no
genie.jobs.submission.enabled	Whether new job submission is enabled (`true`) or disabled (`false`)	true	yes
genie.jobs.submission.disabledMessage	A message to return to the users when new job submission is disabled	Job submission is currently disabled. Please try again later.	yes
genie.jobs.users.runAsUserEnabled	Whether Genie should run the jobs as the user who submitted the job or not. Genie user must have sudo rights for this to work.	false	no
genie.leader.enabled	Whether this node should be the leader of the cluster or not. Should only be used if leadership is not being determined by Zookeeper or other mechanism via Spring	false	no
genie.notifications.sns.enabled	Wether to enable SNS publishing of events	-	no
genie.notifications.sns.topicARN	The SNS topic to publish to	-	no
genie.notifications.sns.additionalEventKeys.<KEY>	Map of KEYs and corresponding values to be added to the SNS messages published	-	no
genie.redis.enabled	Whether to enable storage of HTTP sessions inside Redis via Spring Session	false	no
genie.retry.archived-job-get-metadata.initialDelay	The initial interval between retries to get archived job metadata. Milliseconds	1000	no
genie.retry.archived-job-get-metadata.multiplier	The amount the delay should increase on every retry. e.g. start at 1 second → 2 seconds → 4 seconds with a value of 2.0	2.0	no
genie.retry.archived-job-get-metadata.noOfRetries	The number of times to retry requests to get archived job metadata before failure	5	no
genie.retry.initialInterval	The amount of time to wait after initial failure before retrying the first time in milliseconds	10000	no
genie.retry.maxInterval	The maximum amount of time to wait between retries for the final retry in the back-off policy	60000	no
genie.retry.noOfRetries	The number of times to retry requests to before failure	5	no
genie.retry.s3.noOfRetries	The number of times to retry requests to S3 before failure	5	no
genie.retry.sns.noOfRetries	The number of times to retry requests to SNS before failure	5	no
genie.scripts-manager.refresh-interval	Interval for the script manager to reload and recompile known scripts (in milliseconds)	300000	no
genie.scripts.agent-launcher-selector.properties-refresh-interval	Interval for refreshing property values passed to the script.	5m	no
genie.scripts.agent-launcher-selector.source	URI of the script to load. `AgentLauncherSelectorManagedScript` is enabled only if this property is set.	null	no
genie.scripts.agent-launcher-selector.auto-load-enabled	If true, the script eagerly load during startup, as opposed to lazily load on first use.	false	no
genie.scripts.agent-launcher-selector.timeout	Maximum script execution time (in milliseconds). After this time has elapsed, evaluation is shut down.	5000	no
genie.scripts.cluster-selector.properties-refresh-interval	Interval for refreshing property values passed to the script.	5m	no
genie.scripts.cluster-selector.source	URI of the script to load. `ScriptClusterSelector` is enabled only if this property is set.	null	no
genie.scripts.cluster-selector.auto-load-enabled	If true, the script eagerly load during startup, as opposed to lazily load on first use.	false	no
genie.scripts.cluster-selector.timeout	Maximum script execution time (in milliseconds). After this time has elapsed, evaluation is shut down.	5000	no
genie.scripts.command-selector.properties-refresh-interval	Interval for refreshing property values passed to the script.	5m	no
genie.scripts.command-selector.source	URI of the script to load. `ScriptCommandSelector` is enabled only if this property is set.	null	no
genie.scripts.command-selector.auto-load-enabled	If true, the script eagerly load during startup, as opposed to lazily load on first use.	false	no
genie.scripts.command-selector.timeout	Maximum script execution time (in milliseconds). After this time has elapsed, evaluation is shut down.	5000	no
genie.tasks.agent-cleanup.enabled	Whether to enable the task that detects jobs whose agent has gone AWOL, and marks them failed	true	no
genie.tasks.agent-cleanup.launchTimeLimit	How long a job can stay in ACCEPTED state, waiting for the agent to claim it, before the job is marked failed, in milliseconds	240000	no
genie.tasks.agent-cleanup.refreshInterval	How often the AWOL agent tasks executed, in milliseconds	10000	no
genie.tasks.agent-cleanup.reconnectTimeLimit	How long of a leeway to give a job after its agent disconnected and before the job is marked failed, in milliseconds	120000	no
genie.tasks.archive-status-cleanup.enabled	Whether to enable the task that detects jobs whose archive status was left in PENDING state	true	no
genie.tasks.archive-status-cleanup.check-interval	How often the archive status tasks executed	10s	no
genie.tasks.archive-status-cleanup.gracePeriod	How much time since the final status update to give to jobs before marking the status as UNKNOWN	2m	no
genie.tasks.database-cleanup.batch-size	The max number of entities to delete per transaction (applies to files, clusters, commands, tags, applications)	1000	yes
genie.tasks.database-cleanup.application-cleanup.skip	Skip the Applications table when performing database cleanup	false	yes
genie.tasks.database-cleanup.cluster-cleanup.skip	Skip the Clusters table when performing database cleanup	false	yes
genie.tasks.database-cleanup.command-cleanup.skip	Skip the Commands table when performing database cleanup	false	yes
genie.tasks.database-cleanup.command-deactivation.commandCreationThreshold	The number of days before the current cleanup run that a command must have been created before in the system to be considered for deactivation.	false	yes
genie.tasks.database-cleanup.command-deactivation.jobCreationThreshold	The number of days before the current cleanup run that command must not have been used in a job for that command to be considered for deactivation.	false	yes
genie.tasks.database-cleanup.command-deactivation.skip	Skip deactivating Commands when performing database cleanup	false	yes
genie.tasks.database-cleanup.enabled	Whether or not to delete old and unused records from the database at a scheduled interval. See: `genie.tasks.database-cleanup.expression`	true	no
genie.tasks.database-cleanup.expression	The cron expression for how often to run the database cleanup task	0 0 0 * * *	yes
genie.tasks.database-cleanup.file-cleanup.skip	Skip the Files table when performing database cleanup	false	yes
genie.tasks.database-cleanup.job-cleanup.skip	Skip the Jobs table when performing database cleanup	false	yes
genie.tasks.database-cleanup.job-cleanup.pageSize	The max number of jobs to delete per transaction	1000	yes
genie.tasks.database-cleanup.job-cleanup.retention	The number of days to retain jobs in the database	90	yes
genie.tasks.database-cleanup.tag-cleanup.skip	Skip the Tags table when performing database cleanup	false	yes
genie.tasks.disk-cleanup.enabled	Whether or not to remove old job directories on the Genie node or not	true	no
genie.tasks.disk-cleanup.expression	How often to run the disk cleanup task as a cron expression	0 0 0 * * *	no
genie.tasks.disk-cleanup.retention	The number of days to leave old job directories on disk	3	no
genie.tasks.executor.pool.size	The number of executor threads available for tasks to be run on within the node in an adhoc manner. Best to set to the number of CPU cores x 2 + 1	1	no
genie.tasks.scheduler.pool.size	The number of available threads for the scheduler to use to run tasks on the node at scheduled intervals. Best to set to the number of CPU cores x 2 + 1	1	no
genie.tasks.user-metrics.enabled	Whether or not to publish user-tagged metrics	true	no
genie.tasks.user-metrics.refresh-interval	Publish/refresh interval in milliseconds	30000	no
genie.zookeeper.discovery-path	The namespace to use for Genie discovery service (maps agents to the node they’re connected to)	/genie/discovery/	no
genie.zookeeper.leader-path	The namespace to use for Genie leadership election of a given cluster	/genie/leader/	no

genie.agent.connection-tracking.cleanup-interval

Interval at which the cleanup task runs

genie.agent.connection-tracking.connection-expiration-period

How long after the last heartbeat an agent connection is marked expired

10s

genie.agent.configuration.cache-refresh-interval

Interval for after which the agent properties cache is refreshed

genie.agent.filestream.max-concurrent-transfers

Maximum number of concurrent file transfers that a server allows

100

genie.agent.filestream.unclaimed-stream-start-timeout

Interval after which a transfer stream is shut down if it didn’t send the first chunk of data

10s

genie.agent.filestream.stalled-transfer-timeout

Interval after which a transfer stream is shut down if it didn’t send any more data

20s

genie.agent.filestream.stalled-transfer-check-interval

Interval for checking on stalled downloads

genie.agent.filestream.write-retry-delay

Interval between attempts to write data into a stream buffer

300ms

genie.agent.filter.enabled

If set to true, enables the built-in agent filter service. The filter behavior is controlled by other active genie.agent.filter.* properties.

genie.agent.filter.version.blacklist

A regex matched against agent version (e.g., 1\.2\..*), matching agents are rejected. The filter needs to be enabled for this to take effect.

yes

genie.agent.filter.version.minimum

The minimum version an agent needs to be (e.g., 1.2.3) in order to communicate with this server. The filter needs to be enabled for this to take effect.

yes

genie.agent.filter.version.whitelist

A regex matched against agent version (e.g., 1\.2\..*), agents not matching are rejected. The filter needs to be enabled for this to take effect.

yes

genie.agent.heart-beat.send-interval

Interval for sending heartbeats to all connected clients.

genie.agent.launcher.local.additional-environment

Environment variables to set when spawning an agent (in addition to the inherited server environment)

genie.agent.launcher.local.agent-jar-path

The location of the agent jar. The value is substituted in the command template if the corresponding placeholder is present.

/tmp/genie-agent.jar

genie.agent.launcher.local.enabled

Enable or disable the corresponding launcher.

true

genie.agent.launcher.local.host-info-expire-after

How long after the job information for this host is written into a local cache is it evicted. See Spring Docs for Duration conversion details.

genie.agent.launcher.local.host-info-refresh-after

How long after the job information for this host is written should it be automatically refreshed from the underlying data source. See Spring Docs for Duration conversion details.

30s

genie.agent.launcher.local.launch-command-template

The system command that the launcher should use to launch an agent process. Ordered list of arguments. Contains placeholders that will be replaced at runtime.

java -jar <AGENT_JAR_PLACEHOLDER> exec --server-host 127.0.0.1 --server-port <SERVER_PORT_PLACEHOLDER> --api-job --job-id <JOB_ID_PLACEHOLDER>

genie.agent.launcher.local.max-job-memory

The maximum amount of memory, in megabytes, that a job can be allocated while using the local launcher

10240

genie.agent.launcher.local.max-total-job-memory

The total number of MB out of the system memory that Genie can use for running agents

30720

genie.agent.launcher.local.process-output-capture-enabled

Whether to capture stdout and stderr from the forked agent subprocess to a file for debugging purposes

false

genie.agent.launcher.local.run-as-user-enabled

Whether to launch the agent subprocess as the user specified in the job request

false

genie.agent.launcher.local.server-hostname

The hostname to connect to. This value is used if the command-line template references the server hostname placeholder

127.0.0.1

genie.agent.launcher.titus.application-name

The name of the application that launches the Titus job

genie

genie.agent.launcher.titus.additional-bandwidth

An additional amount of overhead bandwidth that should be added to whatever the job originally requested

yes

genie.agent.launcher.titus.additional-cpu

An additional number of CPUs that should be added to whatever the job originally requested

yes

genie.agent.launcher.titus.additional-disk-size

An additional amount of disk space that should be added to whatever the job originally requested

1GB

yes

genie.agent.launcher.titus.additional-environment

Map of additional environment variables to set for the job

empty

yes

genie.agent.launcher.titus.additional-gpu

An additional number of GPUs that should be added to whatever the job originally requested

yes

genie.agent.launcher.titus.additional-job-attributes

Map of attributes to send with the Titus request in addition to whatever defaults there are

empty

yes

genie.agent.launcher.titus.additional-memory

An additional amount of memory that should be added to whatever the job originally requested

2GB

yes

genie.agent.launcher.titus.capacity-group

The name of the capacity group to target

default

yes

genie.agent.launcher.titus.command-template

The container command array, placeholder values are substituted at runtime

[exec, --api-job, --launchInJobDirectory, --job-id, <JOB_ID>, --server-host, <SERVER_HOST>, --server-port, <SERVER_PORT>]

genie.agent.launcher.titus.container-attributes

Map attributes to send to Titus specific to the container

empty

yes

genie.agent.launcher.titus.detail

The detail (jobGroupInfo) within the application space for Titus request

empty string

genie.agent.launcher.titus.enabled

Whether the Titus agent launcher is active or not

false

genie.agent.launcher.titus.endpoint

The Titus API endpoint URI

https://example-titus-endpoint.tld:1234

genie.agent.launcher.titus.entry-point-template

The container entry point array, placeholder values are substituted at runtime

[/bin/genie-agent]

genie.agent.launcher.titus.genie-server-host

The hostname of the Genie server or cluster for the agent to connect to

example.genie.tld

genie.agent.launcher.titus.genie-server-port

The port number of the Genie server or cluster for the agent to connect to

9090

genie.agent.launcher.titus.healthIndicator-max-size

Maximum number of recently launched jobs displayed in the health indicator details

100

genie.agent.launcher.titus.health-indicator-expiration

Maximum time a job is displayed in the health indicator details after launch

30m

genie.agent.launcher.titus.i-am-role

The IAM role to launch the job as

arn:aws:iam::000000000:role/SomeProfile

genie.agent.launcher.titus.image-name

The name of the container image

image-name

yes

genie.agent.launcher.titus.image-tag

The image tag

latest

yes

genie.agent.launcher.titus.minimum-bandwidth

The minimum network bandwidth to allocate to the container

7MB

yes

genie.agent.launcher.titus.minimum-cpu

The minimum number of CPUs any container should launch with

yes

genie.agent.launcher.titus.minimum-disk-size

The minimum size of the disk volume to attach to the job container

10GB

yes

genie.agent.launcher.titus.minimum-gpu

The minimum number of GPUs any container should launch with

yes

genie.agent.launcher.titus.minimum-memory

The minimum amount of memory a container should be allocated

2GB

yes

genie.agent.launcher.titus.owner-email

The team email of the owners of the Titus job

owners@genie.tld

genie.agent.launcher.titus.retries

How many times to retry launch if the job fails

yes

genie.agent.launcher.titus.runtime-limit

The maximum runtime of the job

12H

genie.agent.launcher.titus.security-attributes

A map of security attributes

empty map

genie.agent.launcher.titus.security-groups

A list of security groups for the job

empty list

genie.agent.launcher.titus.sequence

The sequence (jobGroupInfo) within the application space for Titus request

empty string

genie.agent.launcher.titus.stack

The stack (jobGroupInfo) within the application space for Titus request

empty string

genie.agent.routing.refresh-interval

Interval at which individual connections are refreshed

genie.agent.configuration.dynamic.*

Properties with this prefix are forwarded to each agent during startup (with the prefix stripped)

yes

genie.aws.credentials.role

The AWS role ARN to assume when connecting to S3. If this is set Genie will create a credentials provider that will attempt to assume this role on the host Genie is running on

genie.aws.s3.buckets.[bucketName].roleARN

For the bucket with name bucketName the ARN of the role to assume to read/write to that bucket. If a bucket is used that isn’t listed in this map the default credentials configured will be used

genie.aws.s3.buckets.[bucketName].region

The AWS region the bucket with bucketName is in. Clients to talk to this bucket will be created in this region. If no value is specified the bucket is assumed to be in the same region as the Genie process.

genie.grpc.server.services.job-file-sync.ackIntervalMilliseconds

How many milliseconds to wait between checks whether some acknowledgement should be sent to the agent regardless of whether the maxSyncMessages threshold has been reached or not

30,000

genie.grpc.server.services.job-file-sync.maxSyncMessages

How many messages to receive from the agent before an acknowledgement message is sent back from the server

genie.health.maxCpuLoadConsecutiveOccurrences

Defines the threshold of consecutive occurrences of CPU load crossing the <maxCpuLoadPercent>. Health of the system is marked unhealthy if the CPU load of a system goes beyond the threshold 'maxCpuLoadPercent' for 'maxCpuLoadConsecutiveOccurrences' consecutive times.

genie.health.maxCpuLoadPercent

Defines the threshold for the maximum CPU load percentage to consider for an instance to be unhealthy. Health of the system is marked unhealthy if the CPU load of a system goes beyond this threshold for 'maxCpuLoadConsecutiveOccurrences' consecutive times.

genie.http.connect.timeout

The number of milliseconds before HTTP calls between Genie nodes should time out on connection

2000

genie.http.read.timeout

The number of milliseconds before HTTP calls between Genie nodes should time out on attempting to read data

10000

genie.jobs.active-limit.count

The maximum number of active jobs a user is allowed to have. Once a user hits this limit, jobs submitted are rejected. This is property is ignored unless genie.jobs.users.active-limit.enabled is set to true. This limit applies to users that don’t have an override set via genie.jobs.users.active-limit.overrides.<user-name>.

100

genie.jobs.active-limit.enabled

Enables the per-user active job limit. The number of jobs is controlled by the genie.jobs.users.active-limit.count property.

false

genie.jobs.active-limit.overrides.<user-name>

The maximum number of active jobs that user 'user-name' is allowed to have. This is property is ignored unless genie.jobs.users.active-limit.enabled is set to true.

yes

genie.jobs.attachments.location-prefix

Common prefix where attachments are stored

s3://genie/attachments

genie.jobs.attachments.max-size

Maximum size of an attachment

100MB

genie.jobs.attachments.max-total-size

Maximum size of all attachments combined (Spring and Tomcat may also independently limit the size of upload)

150MB

genie.jobs.files.filter.case-sensitive-matching

Whether the regular expressions defined in genie.jobs.files.filter.* are case-sensitive or not.

true

genie.jobs.files.filter.directory-traversal-reject-patterns

List of regex patterns, if a directory matches any, then its contents are not included in the job files manifest

[]

genie.jobs.files.filter.directory-reject-patterns

List of regex patterns, if a directory matches any, then it is not included in the job files manifest

[]

genie.jobs.files.filter.file-reject-patterns

List of regex patterns, if a file matches any, then it is not included in the job files manifest

[]

genie.jobs.forwarding.enabled

Whether to attempt to forward kill and get output requests for jobs

true

genie.jobs.forwarding.port

The port to forward requests to as it could be different than ELB port

8080

genie.jobs.forwarding.scheme

The connection protocol to use (http or https)

http

genie.jobs.locations.archives

The default root location where job archives should be stored. Scheme should be included. Created if doesn’t exist.

file://${java.io.tmpdir}genie/archives/

genie.jobs.locations.jobs

The default root location where job working directories will be placed. Created by system if doesn’t exist.

file://${java.io.tmpdir}genie/jobs/

genie.jobs.memory.maxSystemMemory

The total number of MB out of the system memory that Genie can use for running jobs

30720

genie.jobs.memory.defaultJobMemory

The total number of megabytes Genie will assume a job is allocated if not overridden by a command or user at runtime

1024

genie.jobs.memory.maxJobMemory

The maximum amount of memory, in megabytes, that a job client can be allocated

10240

genie.jobs.submission.enabled

Whether new job submission is enabled (true) or disabled (false)

true

yes

genie.jobs.submission.disabledMessage

A message to return to the users when new job submission is disabled

Job submission is currently disabled. Please try again later.

yes

genie.jobs.users.runAsUserEnabled

Whether Genie should run the jobs as the user who submitted the job or not. Genie user must have sudo rights for this to work.

false

genie.leader.enabled

Whether this node should be the leader of the cluster or not. Should only be used if leadership is not being determined by Zookeeper or other mechanism via Spring

false

genie.notifications.sns.enabled

Wether to enable SNS publishing of events

genie.notifications.sns.topicARN

The SNS topic to publish to

genie.notifications.sns.additionalEventKeys.<KEY>

Map of KEYs and corresponding values to be added to the SNS messages published

genie.redis.enabled

Whether to enable storage of HTTP sessions inside Redis via Spring Session

false

genie.retry.archived-job-get-metadata.initialDelay

The initial interval between retries to get archived job metadata. Milliseconds

1000

genie.retry.archived-job-get-metadata.multiplier

The amount the delay should increase on every retry. e.g. start at 1 second → 2 seconds → 4 seconds with a value of 2.0

2.0

genie.retry.archived-job-get-metadata.noOfRetries

The number of times to retry requests to get archived job metadata before failure

genie.retry.initialInterval

The amount of time to wait after initial failure before retrying the first time in milliseconds

10000

genie.retry.maxInterval

The maximum amount of time to wait between retries for the final retry in the back-off policy

60000

genie.retry.noOfRetries

The number of times to retry requests to before failure

genie.retry.s3.noOfRetries

The number of times to retry requests to S3 before failure

genie.retry.sns.noOfRetries

The number of times to retry requests to SNS before failure

genie.scripts-manager.refresh-interval

Interval for the script manager to reload and recompile known scripts (in milliseconds)

300000

genie.scripts.agent-launcher-selector.properties-refresh-interval

Interval for refreshing property values passed to the script.

genie.scripts.agent-launcher-selector.source

URI of the script to load. AgentLauncherSelectorManagedScript is enabled only if this property is set.

null

genie.scripts.agent-launcher-selector.auto-load-enabled

If true, the script eagerly load during startup, as opposed to lazily load on first use.

false

genie.scripts.agent-launcher-selector.timeout

Maximum script execution time (in milliseconds). After this time has elapsed, evaluation is shut down.

5000

genie.scripts.cluster-selector.properties-refresh-interval

Interval for refreshing property values passed to the script.

genie.scripts.cluster-selector.source

URI of the script to load. ScriptClusterSelector is enabled only if this property is set.

null

genie.scripts.cluster-selector.auto-load-enabled

If true, the script eagerly load during startup, as opposed to lazily load on first use.

false

genie.scripts.cluster-selector.timeout

Maximum script execution time (in milliseconds). After this time has elapsed, evaluation is shut down.

5000

genie.scripts.command-selector.properties-refresh-interval

Interval for refreshing property values passed to the script.

genie.scripts.command-selector.source

URI of the script to load. ScriptCommandSelector is enabled only if this property is set.

null

genie.scripts.command-selector.auto-load-enabled

If true, the script eagerly load during startup, as opposed to lazily load on first use.

false

genie.scripts.command-selector.timeout

Maximum script execution time (in milliseconds). After this time has elapsed, evaluation is shut down.

5000

genie.tasks.agent-cleanup.enabled

Whether to enable the task that detects jobs whose agent has gone AWOL, and marks them failed

true

genie.tasks.agent-cleanup.launchTimeLimit

How long a job can stay in ACCEPTED state, waiting for the agent to claim it, before the job is marked failed, in milliseconds

240000

genie.tasks.agent-cleanup.refreshInterval

How often the AWOL agent tasks executed, in milliseconds

10000

genie.tasks.agent-cleanup.reconnectTimeLimit

How long of a leeway to give a job after its agent disconnected and before the job is marked failed, in milliseconds

120000

genie.tasks.archive-status-cleanup.enabled

Whether to enable the task that detects jobs whose archive status was left in PENDING state

true

genie.tasks.archive-status-cleanup.check-interval

How often the archive status tasks executed

10s

genie.tasks.archive-status-cleanup.gracePeriod

How much time since the final status update to give to jobs before marking the status as UNKNOWN

genie.tasks.database-cleanup.batch-size

The max number of entities to delete per transaction (applies to files, clusters, commands, tags, applications)

1000

yes

genie.tasks.database-cleanup.application-cleanup.skip

Skip the Applications table when performing database cleanup

false

yes

genie.tasks.database-cleanup.cluster-cleanup.skip

Skip the Clusters table when performing database cleanup

false

yes

genie.tasks.database-cleanup.command-cleanup.skip

Skip the Commands table when performing database cleanup

false

yes

genie.tasks.database-cleanup.command-deactivation.commandCreationThreshold

The number of days before the current cleanup run that a command must have been created before in the system to be considered for deactivation.

false

yes

genie.tasks.database-cleanup.command-deactivation.jobCreationThreshold

The number of days before the current cleanup run that command must not have been used in a job for that command to be considered for deactivation.

false

yes

genie.tasks.database-cleanup.command-deactivation.skip

Skip deactivating Commands when performing database cleanup

false

yes

genie.tasks.database-cleanup.enabled

Whether or not to delete old and unused records from the database at a scheduled interval. See: genie.tasks.database-cleanup.expression

true

genie.tasks.database-cleanup.expression

The cron expression for how often to run the database cleanup task

0 0 0 * * *

yes

genie.tasks.database-cleanup.file-cleanup.skip

Skip the Files table when performing database cleanup

false

yes

genie.tasks.database-cleanup.job-cleanup.skip

Skip the Jobs table when performing database cleanup

false

yes

genie.tasks.database-cleanup.job-cleanup.pageSize

The max number of jobs to delete per transaction

1000

yes

genie.tasks.database-cleanup.job-cleanup.retention

The number of days to retain jobs in the database

yes

genie.tasks.database-cleanup.tag-cleanup.skip

Skip the Tags table when performing database cleanup

false

yes

genie.tasks.disk-cleanup.enabled

Whether or not to remove old job directories on the Genie node or not

true

genie.tasks.disk-cleanup.expression

How often to run the disk cleanup task as a cron expression

0 0 0 * * *

genie.tasks.disk-cleanup.retention

The number of days to leave old job directories on disk

genie.tasks.executor.pool.size

The number of executor threads available for tasks to be run on within the node in an adhoc manner. Best to set to the number of CPU cores x 2 + 1

genie.tasks.scheduler.pool.size

The number of available threads for the scheduler to use to run tasks on the node at scheduled intervals. Best to set to the number of CPU cores x 2 + 1

genie.tasks.user-metrics.enabled

Whether or not to publish user-tagged metrics

true

genie.tasks.user-metrics.refresh-interval

Publish/refresh interval in milliseconds

30000

genie.zookeeper.discovery-path

The namespace to use for Genie discovery service (maps agents to the node they’re connected to)

/genie/discovery/

genie.zookeeper.leader-path

The namespace to use for Genie leadership election of a given cluster

/genie/leader/

12.1.2. Spring Properties

Spring Properties

Property Description Default Value

Property	Description	Default Value
info.genie.version	The Genie version to be displayed by the UI and returned by the actuator /info endpoint. Set by the build.	Current build version
management.endpoints.web.base-path	The default base path for the Spring Actuator[https://docs.spring.io/spring-boot/docs/current/actuator-api/html/] management endpoints. Switched from default `/actuator`	/admin
spring.application.name	The name of the application in the Spring context	genie
spring.banner.location	Banner file location	genie-banner.txt
spring.data.redis.repositories.enabled	Whether Spring data repositories should attempt to be created for Redis	false
spring.datasource.url	JDBC URL of the database	jdbc:h2:mem:genie
spring.datasource.username	Username for the datasource	root
spring.datasource.password	Database password
spring.datasource.hikari.leak-detection-threshold	How long to wait (in milliseconds) before a connection should be considered leaked out of the pool if it hasn’t been returned	30000
spring.datasource.hikari.pool-name	The name of the connection pool. Will show up in logs under this name.	genie-hikari-db-pool
spring.flyway.baselineDescription	Description for the initial baseline of a database instance	Base Version
spring.flyway.baselineOnMigrate	Whether or not to baseline when Flyway is present and the datasource targets a DB that isn’t managed by Flyway	true
spring.flyway.baselineVersion	Initial DB version (When Genie migrated to Flyway is current setting. Shouldn’t touch)	3.2.0
spring.flyway.locations	Where flyway should look for database migration files	classpath:db/migration/{vendor}
spring.jackson.serialization.write-dates-as-timestamps	Whether to serialize instants as timestamps or ISO8601 strings	false
spring.jackson.time-zone	Time zone used when formatting dates. For instance `America/Los_Angeles`	UTC
spring.jpa.hibernate.ddl-auto	DDL mode. This is actually a shortcut for the "hibernate.hbm2ddl.auto" property.	validate
spring.jpa.hibernate.properties.hibernate.jdbc.time_zone	The timezone to use when writing dates to the database see article	UTC
spring.profiles.active	The default active profiles when Genie is run	dev
spring.mail.host	The hostname of the mail server
spring.mail.testConnection	Whether to check the connection to the mail server on startup	false
spring.redis.host	Endpoint for the Redis cluster used to store HTTP session information
spring.servlet.multipart.max-file-size	Max attachment file size. Values can use the suffixed "MB" or "KB" to indicate a Megabyte or Kilobyte size.	100MB
spring.servlet.multipart.max-request-size	Max job request size. Values can use the suffixed "MB" or "KB" to indicate a Megabyte or Kilobyte size.	200MB
spring.session.store-type	The back end storage system for Spring to store HTTP session information. See Spring Boot Session for more information. Currently on classpath only none, redis and jdbc will work.	none

info.genie.version

The Genie version to be displayed by the UI and returned by the actuator /info endpoint. Set by the build.

Current build version

management.endpoints.web.base-path

The default base path for the Spring Actuator[https://docs.spring.io/spring-boot/docs/current/actuator-api/html/] management endpoints. Switched from default /actuator

/admin

spring.application.name

The name of the application in the Spring context

genie

spring.banner.location

Banner file location

genie-banner.txt

spring.data.redis.repositories.enabled

Whether Spring data repositories should attempt to be created for Redis

false

spring.datasource.url

JDBC URL of the database

jdbc:h2:mem:genie

spring.datasource.username

Username for the datasource

root

spring.datasource.password

Database password

spring.datasource.hikari.leak-detection-threshold

How long to wait (in milliseconds) before a connection should be considered leaked out of the pool if it hasn’t been returned

30000

spring.datasource.hikari.pool-name

The name of the connection pool. Will show up in logs under this name.

genie-hikari-db-pool

spring.flyway.baselineDescription

Description for the initial baseline of a database instance

Base Version

spring.flyway.baselineOnMigrate

Whether or not to baseline when Flyway is present and the datasource targets a DB that isn’t managed by Flyway

true

spring.flyway.baselineVersion

Initial DB version (When Genie migrated to Flyway is current setting. Shouldn’t touch)

3.2.0

spring.flyway.locations

Where flyway should look for database migration files

classpath:db/migration/{vendor}

spring.jackson.serialization.write-dates-as-timestamps

Whether to serialize instants as timestamps or ISO8601 strings

false

spring.jackson.time-zone

Time zone used when formatting dates. For instance America/Los_Angeles

UTC

spring.jpa.hibernate.ddl-auto

DDL mode. This is actually a shortcut for the "hibernate.hbm2ddl.auto" property.

validate

spring.jpa.hibernate.properties.hibernate.jdbc.time_zone

The timezone to use when writing dates to the database see article

UTC

spring.profiles.active

The default active profiles when Genie is run

dev

spring.mail.host

The hostname of the mail server

spring.mail.testConnection

Whether to check the connection to the mail server on startup

false

spring.redis.host

Endpoint for the Redis cluster used to store HTTP session information

spring.servlet.multipart.max-file-size

Max attachment file size. Values can use the suffixed "MB" or "KB" to indicate a Megabyte or Kilobyte size.

100MB

spring.servlet.multipart.max-request-size

Max job request size. Values can use the suffixed "MB" or "KB" to indicate a Megabyte or Kilobyte size.

200MB

spring.session.store-type

The back end storage system for Spring to store HTTP session information. See Spring Boot Session for more information. Currently on classpath only none, redis and jdbc will work.

none

12.1.3. Spring Cloud Properties

Properties set by default to manipulate various Spring Cloud libraries.

Property	Description	Default Value
cloud.aws.credentials.useDefaultAwsCredentialsChain	Whether to attempt creation of a standard AWS credentials chain. See Spring Cloud AWS for more information.	true
cloud.aws.region.auto	Whether the AWS region will be attempted to be auto recognized via the AWS metadata services on EC2. See Spring Cloud AWS for more information.	false
cloud.aws.region.static	The default AWS region. See Spring Cloud AWS for more information.	us-east-1
cloud.aws.stack.auto	Whether auto stack detection is enabled. See Spring Cloud AWS for more information.	false
spring.cloud.zookeeper.enabled	Whether to enable zookeeper functionality or not	false
spring.cloud.zookeeper.connectString	The connection string for the zookeeper cluster	localhost:2181

Property

Description

Default Value

cloud.aws.credentials.useDefaultAwsCredentialsChain

Whether to attempt creation of a standard AWS credentials chain. See Spring Cloud AWS for more information.

true

cloud.aws.region.auto

Whether the AWS region will be attempted to be auto recognized via the AWS metadata services on EC2. See Spring Cloud AWS for more information.

false

cloud.aws.region.static

The default AWS region. See Spring Cloud AWS for more information.

us-east-1

cloud.aws.stack.auto

Whether auto stack detection is enabled. See Spring Cloud AWS for more information.

false

spring.cloud.zookeeper.enabled

Whether to enable zookeeper functionality or not

false

spring.cloud.zookeeper.connectString

The connection string for the zookeeper cluster

localhost:2181

12.1.4. gRPC Server properties

Property

Description

Default Value

grpc.server.port

The port on which to bind the gRPC server, if enabled.

9090

grpc.server.address

The address on which to bind the gRPC server, if enabled.

0.0.0.0

12.2. Profile Specific Properties

12.2.1. Prod Profile

Property	Description	Default Value
spring.datasource.url	JDBC URL of the database	jdbc:mysql://127.0.0.1/genie?useUnicode=yes&characterEncoding=UTF-8&useLegacyDatetimeCode=false
spring.datasource.username	Username for the datasource	root
spring.datasource.password	Database password
spring.datasource.hikari.data-source-properties.cachePrepStmts	MySQL Tuning	true
spring.datasource.hikari.data-source-properties.prepStmtCacheSize	MySQL Tuning	250
spring.datasource.hikari.data-source-properties.prepStmtCacheSqlLimit	MySQL Tuning	2048
spring.datasource.hikari.data-source-properties.serverTimezone	MySQL Tuning	UTC
spring.datasource.hikari.data-source-properties.userServerPrepStatements	MySQL Tuning	true

Property

Description

Default Value

spring.datasource.url

JDBC URL of the database

jdbc:mysql://127.0.0.1/genie?useUnicode=yes&characterEncoding=UTF-8&useLegacyDatetimeCode=false

spring.datasource.username

Username for the datasource

root

spring.datasource.password

Database password

spring.datasource.hikari.data-source-properties.cachePrepStmts

MySQL Tuning

true

spring.datasource.hikari.data-source-properties.prepStmtCacheSize

MySQL Tuning

250

spring.datasource.hikari.data-source-properties.prepStmtCacheSqlLimit

MySQL Tuning

2048

spring.datasource.hikari.data-source-properties.serverTimezone

MySQL Tuning

UTC

spring.datasource.hikari.data-source-properties.userServerPrepStatements

MySQL Tuning

true

13. Genie Agent

This section describes the various properties that can be set to control the behavior of the Genie agent.

Unless otherwise noted, properties are loaded from the standard sources (defaults, profiles, other files). The server also has a chance to override them during the 'Agent Configuration' execution stage.

13.1. Properties forwarding

Server-side properties that match a given prefix are forwarded to all agents during agent configuration stage. For a property to be forwarded, it needs to be prefixed with genie.agent.configuration.dynamic..

Example: Set property genie.agent.configuration.dynamic.genie.agent.runtime.emergency-shutdown-delay on the server to have it forwarded to all agents as genie.agent.runtime.emergency-shutdown-delay.

Property Description Default Value Notes

Property	Description	Default Value	Notes
`genie.agent.runtime.emergency-shutdown-delay`	Time allowed to the agent to shut down cleanly (archive, cleanup, …) before the JVM is forcefully shut down	5m
`genie.agent.runtime.force-manifest-refresh-timeout`	Maximum time block when trying to forcefully push a manifest update	5s
`genie.agent.runtime.file-stream-service.error-back-off.delay-type`	Scheduling policy for backoff in case of error during file streaming	FROM_PREVIOUS_EXECUTION_BEGIN
`genie.agent.runtime.file-stream-service.error-back-off.min-delay`	Minimum delay before another attempt during file streaming	1s
`genie.agent.runtime.file-stream-service.error-back-off.max-delay`	Maximum delay before another attempt during file streaming	10s
`genie.agent.runtime.file-stream-service.error-back-off.factor`	Multiplication factor for retry delay before another attempt during file streaming	1.1
`genie.agent.runtime.file-stream-service.enable-compression`	Wether to enable compression when transmitting file chunks to the server	true
`genie.agent.runtime.file-stream-service.data-chunk-max-size`	Max size of a file chunk sent to the server	1MB
`genie.agent.runtime.file-stream-service.max-concurrent-streams`	Maximum number of files transmitted concurrently to the server	5
`genie.agent.runtime.file-stream-service.drain-timeout`	Maximum time a file transfer is allowed to complete before it is terminated during agent shutdown	15s	Should be lower then `genie.agent.runtime.emergency-shutdown-delay`
`genie.agent.runtime.heart-beat-service.interval`	Interval between heartbeats	2s
`genie.agent.runtime.heart-beat-service.error-retry-delay`	Interval to wait before re-establishing the heartbeat stream	1s
`genie.agent.runtime.job-kill-service.response-check-back-off.delay-type`	Scheduling policy for backoff in case of error during kill request	FROM_PREVIOUS_EXECUTION_COMPLETION
`genie.agent.runtime.job-kill-service.response-check-back-off.min-delay`	Minimum delay before another attempt during kill request	500ms
`genie.agent.runtime.job-kill-service.response-check-back-off.max-delay`	Maximium delay before another attempt during kill request	5s
`genie.agent.runtime.job-kill-service.response-check-back-off.factor`	Multiplication factor for retry delay before another attempt during kill request	1.2
`genie.agent.runtime.job-monitor-service.check-remote-job-status`	Whether to periodically poll the running job status from the server, and to shut down in case the job is marked failed	true
`genie.agent.runtime.job-monitor-service.check-interval`	How often to check for files limits	1m
`genie.agent.runtime.job-monitor-service.max-files`	Maximum number of files in the job directory	64000
`genie.agent.runtime.job-monitor-service.max-file-size`	Maximum size of the largest file in the job directory	8GB
`genie.agent.runtime.job-monitor-service.max-total-size`	Maximum total size of the job directory	16GB
`genie.agent.runtime.job-setup-service.environment-dump-filter-expression`	Grep regular expression (ERE syntax) filter applied to environment variable dumped into env.log before job execution	.*
`genie.agent.runtime.job-setup-service.environment-dump-filter-inverted`	Wether to invert environment-dump-filter-expression such that environment entries NOT matching the expression are logged	false
`genie.agent.runtime.shutdown.execution-completion-leeway`	Time allowed to the job execution state machine to shut down cleanly before the JVM is shut down	60s
`genie.agent.runtime.shutdown.internal-executors-leeway`	Time allowed on task running on internal task executors to complete before the agent terminates	30s	This property is bound during initialization and cannot be modified at runtime by the server.
`genie.agent.runtime.shutdown.internal-schedulers-leeway`	Time allowed on task running on internal task schedulers to complete before the agent terminates	30s	This property is bound during initialization and cannot be modified at runtime by the server.
`genie.agent.runtime.shutdown.system-executor-leeway`	Time allowed on task running on Spring’s system task executor to complete before the agent terminates	60s	This property is bound during initialization and cannot be modified at runtime by the server.
`genie.agent.runtime.shutdown.system-scheduler-leeway`	Time allowed on task running on Spring’s system task scheduler to complete before the agent terminates	60s	This property is bound during initialization and cannot be modified at runtime by the server.

genie.agent.runtime.emergency-shutdown-delay

Time allowed to the agent to shut down cleanly (archive, cleanup, …) before the JVM is forcefully shut down

genie.agent.runtime.force-manifest-refresh-timeout

Maximum time block when trying to forcefully push a manifest update

genie.agent.runtime.file-stream-service.error-back-off.delay-type

Scheduling policy for backoff in case of error during file streaming

FROM_PREVIOUS_EXECUTION_BEGIN

genie.agent.runtime.file-stream-service.error-back-off.min-delay

Minimum delay before another attempt during file streaming

genie.agent.runtime.file-stream-service.error-back-off.max-delay

Maximum delay before another attempt during file streaming

10s

genie.agent.runtime.file-stream-service.error-back-off.factor

Multiplication factor for retry delay before another attempt during file streaming

1.1

genie.agent.runtime.file-stream-service.enable-compression

Wether to enable compression when transmitting file chunks to the server

true

genie.agent.runtime.file-stream-service.data-chunk-max-size

Max size of a file chunk sent to the server

1MB

genie.agent.runtime.file-stream-service.max-concurrent-streams

Maximum number of files transmitted concurrently to the server

genie.agent.runtime.file-stream-service.drain-timeout

Maximum time a file transfer is allowed to complete before it is terminated during agent shutdown

15s

Should be lower then genie.agent.runtime.emergency-shutdown-delay

genie.agent.runtime.heart-beat-service.interval

Interval between heartbeats

genie.agent.runtime.heart-beat-service.error-retry-delay

Interval to wait before re-establishing the heartbeat stream

genie.agent.runtime.job-kill-service.response-check-back-off.delay-type

Scheduling policy for backoff in case of error during kill request

FROM_PREVIOUS_EXECUTION_COMPLETION

genie.agent.runtime.job-kill-service.response-check-back-off.min-delay

Minimum delay before another attempt during kill request

500ms

genie.agent.runtime.job-kill-service.response-check-back-off.max-delay

Maximium delay before another attempt during kill request

genie.agent.runtime.job-kill-service.response-check-back-off.factor

Multiplication factor for retry delay before another attempt during kill request

1.2

genie.agent.runtime.job-monitor-service.check-remote-job-status

Whether to periodically poll the running job status from the server, and to shut down in case the job is marked failed

true

genie.agent.runtime.job-monitor-service.check-interval

How often to check for files limits

genie.agent.runtime.job-monitor-service.max-files

Maximum number of files in the job directory

64000

genie.agent.runtime.job-monitor-service.max-file-size

Maximum size of the largest file in the job directory

8GB

genie.agent.runtime.job-monitor-service.max-total-size

Maximum total size of the job directory

16GB

genie.agent.runtime.job-setup-service.environment-dump-filter-expression

Grep regular expression (ERE syntax) filter applied to environment variable dumped into env.log before job execution

genie.agent.runtime.job-setup-service.environment-dump-filter-inverted

Wether to invert environment-dump-filter-expression such that environment entries NOT matching the expression are logged

false

genie.agent.runtime.shutdown.execution-completion-leeway

Time allowed to the job execution state machine to shut down cleanly before the JVM is shut down

60s

genie.agent.runtime.shutdown.internal-executors-leeway

Time allowed on task running on internal task executors to complete before the agent terminates

30s

This property is bound during initialization and cannot be modified at runtime by the server.

genie.agent.runtime.shutdown.internal-schedulers-leeway

Time allowed on task running on internal task schedulers to complete before the agent terminates

30s

This property is bound during initialization and cannot be modified at runtime by the server.

genie.agent.runtime.shutdown.system-executor-leeway

Time allowed on task running on Spring’s system task executor to complete before the agent terminates

60s

This property is bound during initialization and cannot be modified at runtime by the server.

genie.agent.runtime.shutdown.system-scheduler-leeway

Time allowed on task running on Spring’s system task scheduler to complete before the agent terminates

60s

This property is bound during initialization and cannot be modified at runtime by the server.

14. Selector Script Properties

Properties with a special prefix are forwarded to the script selector runtime (with the prefix stripped).

For example, server property cluster-selector.canary is passed to the script runtime as canary.

Selector Script Prefix

Selector Script	Prefix
Cluster Selector Script	`cluster-selector.`
Command Selector Script	`command-selector.`
Agent Launcher Selector Script	`agent-launcher-selector.`

Cluster Selector Script

cluster-selector.

Command Selector Script

command-selector.

Agent Launcher Selector Script

agent-launcher-selector.

15. Metrics

The following is an extensive list of metrics (counters, timers, gauges, …) published organically by Genie, in addition to metrics published by Spring, JVM and system metrics and statistics.

Metrics are collected using Micrometer which allows system admins to plugin a variety of backend collection systems (Atlas, Datadog, Graphite, Ganglia, etc). See website for more details. Genie ships with no backend system compiled in. It will have to be added if one is desired otherwise metrics are just published within the local JVM and available on the Actuator /metrics endpoint.

Name Description Unit Source Tags

Name	Description	Unit	Source	Tags
`genie.agents.connections.connected.gauge`	Number of agents connected to the node	count	`AgentRoutingServiceCuratorDiscoveryImpl`	`-`
`genie.agents.connections.lookup.timer`	Timing and count of lookup in Curator’s discovery service	nanoseconds	`AgentRoutingServiceCuratorDiscoveryImpl`	`status, exceptionClass, found`
`genie.agents.connections.registered.gauge`	Number of agents connected to the node and registered in discovery	count	`AgentRoutingServiceCuratorDiscoveryImpl`	`-`
`genie.agents.connections.zookeeperSessionState.counter`	Count of Zookeeper session state changes	count	`AgentRoutingServiceCuratorDiscoveryImpl`	`connectionState`
`genie.agents.connections.registered.timer`	Timing and count of registrations of local agent with discovery service	nanoseconds	`AgentRoutingServiceCuratorDiscoveryImpl`	`status, exceptionClass`
`genie.agents.connections.unregistered.timer`	Timing and count of unregistrations of local agent with discovery service	nanoseconds	`AgentRoutingServiceCuratorDiscoveryImpl`	`status, exceptionClass`
`genie.agents.connections.connected.counter`	Count of new agent connections to the local node	count	`AgentRoutingServiceCuratorDiscoveryImpl`	`-`
`genie.agents.connections.disconnected.counter`	Count of new agent disconnections from the local node	count	`AgentRoutingServiceCuratorDiscoveryImpl`	`-`
`genie.agents.heartbeating.gauge`	The number of agents sending heartbeats to the server	count	`GRpcHeartBeatServiceImpl`	`-`
`genie.agents.fileTransfers.requested.counter`	Count of file transfer from remote agents to this node	count	`GRpcAgentFileStreamServiceImpl`	`-`
`genie.agents.fileTransfers.rejected.counter`	Count of attempted file transfers that were rejected because too many transfers are already in progress on this node	count	`GRpcAgentFileStreamServiceImpl`	`-`
`genie.agents.fileTransfers.manifestCache.size`	Size of the manifest cache	count	`GRpcAgentFileStreamServiceImpl`	`-`
`genie.agents.fileTransfers.controlStreams.size`	Number of active agent control streams	size	`GRpcAgentFileStreamServiceImpl`	`-`
`genie.agents.fileTransfers.timeout.counter`	Count of transfers that timed out on this node	count	`GRpcAgentFileStreamServiceImpl`	`-`
`genie.agents.fileTransfers.transferSize.summary`	Number of bytes requested from the agent for a given transfer	distribution (bytes)	`GRpcAgentFileStreamServiceImpl`	`-`
`genie.agents.fileTransfers.activeTransfers.size`	Number of active transfers on this node	count	`GRpcAgentFileStreamServiceImpl`	`-`
`genie.agents.launchers.launch.timer`	Timer recording how long it takes to execute the launch API for that specific implementation	nanoseconds	`LocalAgentLauncherImpl, TitusAgentLauncherImpl`	`status, exceptionClass, launcherClass`
`genie.api.v3.jobs.submitJobWithoutAttachments.rate`	Counts the number of jobs submitted without an attachment	count	`JobRestController`	`-`
`genie.api.v3.jobs.submitJobWithAttachments.rate`	Counts the number of jobs submitted with one or more attachments	count	`JobRestController`	`-`
`genie.files.serve.timer`	Time taken to serve a file	nanoseconds	`JobDirectoryServerServiceImpl`	`status, exceptionClass, archiveStatus`
`genie.health.indicator.timer`	Time taken for each health indicator to report its status	nanoseconds	`HealthCheckMetricsAspect`	`healthIndicatorClass, healthIndicatorStatus`
`genie.jobs.active.gauge`	Number of jobs currently active locally	amount	`LocalAgentLauncherImpl`	`launcherClass`
`genie.jobs.agentDisconnected.gauge`	Current number of agent jobs whose agent is not connected to any node.	count	`AgentJobCleanupTask`	`-`
`genie.jobs.agentDisconnected.terminated.counter`	Counter of jobs terminated because the agent disappeared for too long	count	`AgentJobCleanupTask`	`status, exceptionClass`
`genie.jobs.agentLauncher.selectors.script.select.timer`	Time taken to select the agent launcher for a job	count	`AgentJobCleanupTask`	`status, exceptionClass, agentLauncherClass`
`genie.jobs.archiveStatus.cleanup.counter`	Counter of jobs whose archive status was left in PENDING state after execution completed	count	`ArchiveStatusCleanupTask`	`status, exceptionClass`
`genie.jobs.attachments.s3.count.distribution`	Distribution summary of the number of files attached	count	`S3AttachmentServiceImpl`
`genie.jobs.attachments.s3.largestSize.distribution`	Distribution summary of the size of the largest file attached	bytes	`S3AttachmentServiceImpl`
`genie.jobs.attachments.s3.totalSize.distribution`	Distribution summary of the total size of the files attached	bytes	`S3AttachmentServiceImpl`
`genie.jobs.attachments.s3.upload.timer`	genie.jobs.attachments.s3.upload.timer	Time taken to upload job attachments to S3 (only measured for jobs with attachments)	`nanoseconds`	`S3AttachmentServiceImpl`
`status, exceptionClass`	genie.jobs.clusters.selectors.script.select.timer	Time taken by the loaded script to select a cluster among the one passed as input	`nanoseconds`	`ScriptClusterSelector`
`status, exceptionClass, clusterName, clusterId`	genie.jobs.memory.used.gauge	Total amount of memory allocated to local jobs (according to job request)	`Megabytes`	`LocalJobLauncherImpl`
`launcherClass`	genie.jobs.notifications.final-state.counter	Count the number of completed job notifications	`count`	`JobNotificationMetricPublisher`
`jobFinalState`	genie.jobs.notifications.state-transition.counter	Count the number of job transitions notifications	`count`	`JobNotificationMetricPublisher`
`fromState, toState`	genie.jobs.submit.rejected.jobs-limit.counter	Count of jobs rejected by the server because the user is exceeding the maximum number of running jobs	`count`	`JobRestController`
`user, jobsUserLimit`	genie.launchers.titus.launch.timer	Time taken to launch request a job be launched on Titus	`nanoseconds`	`TitusAgentLauncher`
`status, exceptionClass`	genie.notifications.sns.publish.counter	Count the number of notification published to SNS	`count`	`AbstractSNSPublisher`
`status, type`	genie.rpc.job.handshake.timer	Time taken to serve an agent request to handshake	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass, agentVersion`	genie.rpc.job.configure.timer	Time taken to serve an agent request to obtain runtime configuration	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass, agentVersion`	genie.rpc.job.reserve.timer	Time taken to serve an agent request to reserve a job	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass, agentVersion`	genie.rpc.job.resolve.timer	Time taken to serve an agent request to resolve a job request into a job specification	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass`	genie.rpc.job.getSpecification.timer	Time taken to serve an agent request to obtain a job specification	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass`	genie.rpc.job.dryRunResolve.timer	Time taken to serve an agent request to resolve a job request into a job specification (dry run mode)	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass`	genie.rpc.job.claim.timer	Time taken to serve an agent request to claim a job for execution	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass, agentVersion`	genie.rpc.job.changeStatus.timer	Time taken to serve an agent request to update a job status	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass, statusFrom, statusTo`	genie.rpc.job.getStatus.timer	Time taken to serve an agent request to obtain a job’s status	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass`	genie.rpc.job.changeArchiveStatus.timer	Time taken to serve an agent request to update a job archive status	`nanoseconds`	`GRpcJobServiceImpl`
`status, exceptionClass, statusTo`	genie.scripts.load.timer	Time taken to load (download, read, compile) a given script	`nanoseconds`	`ScriptManager`
`status, exceptionClass, scriptUri`	genie.scripts.evaluate.timer	Time taken to evaluate a given script (if previously compiled successfully)	`nanoseconds`	`ScriptManager`
`status, exceptionClass, scriptUri`	genie.services.agentConfiguration.reloadProperties.timer	Time taken to retrieve the set of properties forwarded to bootstrapping agents	`nanoseconds`	`AgentConfigurationServiceImpl`
`status, exceptionClass, numProperties`	genie.services.agentJob.handshake.counter	Counter for calls to the 'handshake' protocol of the Genie Agent Job Service	`count`	`AgentJobServiceImpl`
`status, exceptionClass, agentVersion, agentHost, handshakeDecision`	genie.services.jobLaunch.launchJob.timer	Time taken to launch a job (includes record creation and update, job resolution and agent launch)	`nanoseconds`	`JobLaunchServiceImpl`
`status, exceptionClass, agentLauncherSelectedClass`	genie.services.jobLaunch.selectLauncher.timer	Time taken to invoke a selector to choose which agent launcher to use	`nanoseconds`	`JobLaunchServiceImpl`
`status exceptionClass, numAvailableLaunchers, agentLauncherSelectorClass, agentLauncherSelectedClass`	genie.services.jobResolver.generateClusterCriteriaPermutations.timer	Time taken to generate all the permutations for cluster criteria between the command options and the job request	`nanoseconds`	`JobResolverServiceImpl`
	genie.services.jobResolver.resolve.timer	Time taken to completely resolve the job	`nanoseconds`	`JobResolverServiceImpl`
`status, exceptionClass, saved`	genie.services.jobResolver.resolveApplications.timer	Time taken to retrieve applications information for this task	`nanoseconds`	`JobResolverServiceImpl`
`status, exceptionClass`	genie.services.jobResolver.resolveCluster.clusterSelector.counter	Counter for cluster selector algorithms invocations	`count`	`JobResolverServiceImpl`
`class, status, clusterName, clusterId, clusterSelectorClass`	genie.services.jobResolver.resolveCluster.timer	Time taken to resolve the cluster to use for a job	`nanoseconds`	`JobResolverServiceImpl`
`status, clusterName, clusterId, exceptionClass`	genie.services.jobResolver.resolveCommand.timer	Time taken to resolve the command to use for a job	`nanoseconds`	`JobResolverServiceImpl`
`status, commandName, commandId, exceptionClass`	genie.web.services.archivedJobService.getArchivedJobMetadata.timer	The time taken to fetch the metadata of an archived job if it isn’t already cached	`nanoseconds`	`ArchivedJobServiceImpl`
`status, exceptionClass`	genie.tasks.archiveStatusCleanup.timer	Time taken to execute the cleanup task	`nanoseconds`	`ArchiveStatusCleanupTask`
`status, exceptionClass`	genie.tasks.databaseCleanup.applicationDeletion.timer	Time taken to delete application records from the database	`nanoseconds`	`DatabaseCleanupTask`
`status, exceptionClass`	genie.tasks.databaseCleanup.clusterDeletion.timer	Time taken to delete cluster records from the database	`nanoseconds`	`DatabaseCleanupTask`
`status, exceptionClass`	genie.tasks.databaseCleanup.commandDeactivation.timer	Time taken to deactivate command records in the database	`nanoseconds`	`DatabaseCleanupTask`
`status, exceptionClass`	genie.tasks.databaseCleanup.commandDeletion.timer	Time taken to delete command records from the database	`nanoseconds`	`DatabaseCleanupTask`
`status, exceptionClass`	genie.tasks.databaseCleanup.fileDeletion.timer	Time taken to delete file records from the database	`nanoseconds`	`DatabaseCleanupTask`
`status, exceptionClass`	genie.tasks.databaseCleanup.tagDeletion.timer	Time taken to delete tag records from the database	`nanoseconds`	`DatabaseCleanupTask`
`status, exceptionClass`	genie.tasks.databaseCleanup.duration.timer	Time taken to cleanup database records for jobs that executed over a given amount of time in the past	`nanoseconds`	`DatabaseCleanupTask`
`status, exceptionClass`	genie.tasks.databaseCleanup.numDeletedApplications.gauge	Number of deleted application records purged during the last database cleanup pass	`amount`	`DatabaseCleanupTask`
`-`	genie.tasks.databaseCleanup.numDeactivatedCommands.gauge	Number of command records set to INACTIVE during the last database cleanup pass	`amount`	`DatabaseCleanupTask`
`-`	genie.tasks.databaseCleanup.numDeletedClusters.gauge	Number of terminated cluster records purged during the last database cleanup pass	`amount`	`DatabaseCleanupTask`
`-`	genie.tasks.databaseCleanup.numDeletedCommands.gauge	Number of deleted command records purged during the last database cleanup pass	`amount`	`DatabaseCleanupTask`
`-`	genie.tasks.databaseCleanup.numDeletedFiles.gauge	Number of unused file references purged during the last database cleanup pass	`amount`	`DatabaseCleanupTask`
`-`	genie.tasks.databaseCleanup.numDeletedJobs.gauge	Number of job records purged during the last database cleanup pass	`amount`	`DatabaseCleanupTask`
`-`	genie.tasks.databaseCleanup.numDeletedTags.gauge	Number of unused tag records purged during the last database cleanup pass	`amount`	`DatabaseCleanupTask`
`-`	genie.tasks.diskCleanup.numberDeletedJobDirs.gauge	Number of job folders deleted during the last cleanup pass	`amount`	`DiskCleanupTask`
`-`	genie.tasks.diskCleanup.numberDirsUnableToDelete.gauge	Number of failures deleting job folders during the last cleanup pass	`amount`	`DiskCleanupTask`
`-`	genie.tasks.diskCleanup.unableToDeleteJobsDir.rate	Counts the number of times a local job folder could not be deleted	`count`	`DiskCleanupTask`
`-`	genie.tasks.diskCleanup.unableToGetJobs.rate	Counts the number of times a local job folder is encountered during cleanup and the corresponding job record in the database cannot be found	`count`	`DiskCleanupTask`
`-`	genie.user.active-jobs.gauge	Number of active jobs tagged with owner user.	`count`	`UserMetricsTask`
`-`	genie.user.active-memory.gauge	Amount of memory used by active jobs tagged with owner user.	`Megabytes`	`UserMetricsTask`
`-`	genie.user.active-users.gauge	Number of distinct users with at least one job in RUNNING state.	`count`	`UserMetricsTask`
`-`	genie.web.controllers.exception	Counts exceptions returned to the user	`count`	`GenieExceptionMapper`

genie.agents.connections.connected.gauge

Number of agents connected to the node

count