These instructions walk through setting up Genie from scratch. If you just want to quickly evaluate Genie it’s recommended you check out the Genie Docker image . Also seeing the Dockerfile for the release will help understand the instructions contained in here further.
Assumptions
These instructions are for the current Genie release 2.2.3
Installation is happening on a Linux based system
Prerequisites
The following items should be installed and configured in order to successfully set up Genie 2.2.3.
You don’t need a standalone database. If you launch Genie with all default settings it will launch using an in memory database. Obviously this data won’t persist beyond the JVM process shutting down but it is good for development.
Genie ships with MySQL 5.6 JDBC connector jars and Spring configurations in the WAR but you can put in your own if you want to use another database.
Binary distributions of the clients like Hive/Pig/Presto etc., including the command-line scripts that launch these jobs respectively.
You can download these packages from the project pages themselves
If you don’t want to run against the in memory database, aren’t using MySQL or your MySQL isn’t running on localhost you’ll need to modify the database configuration. Genie uses Spring for various functionality including the data access layer. Database connection information is stored in the genie-jpa.xml file.
You’ll find the file in $CATALINA_HOME/webapps/ROOT/WEB-INF/classes
Edit your configurations as needed. If you’re not using MySQL you’ll have to change the driver class. The connection url will have to be changed if it’s not localhost. Currently password is set to nothing so if you have one configured you should set it. Note if you want to use MySQL you’ll need to change the spring active profile at runtime which is described below.
Update Swagger Endpoint (Optional)
Genie ships with integration with Swagger. By default the Swagger JSON is at http://localhost:7001/genie/swagger.json. If you want your install of Genie to support the Swagger UI, located at http://{genieHost}:{port}/docs/api, you’ll need to modify two things if you want to bind the Swagger docs and JSON to anything other than localhost.
On line 19 of $CATALINA_HOME/webapps/ROOT/WEB-INF/classes/genie-swagger.xml modify localhost:7001 to match your hostname and port.
Then in $CATALINA_HOME/webapps/ROOT/WEB-INF/lib you’ll find genie-server-2.2.3.jar. Take this jar and copy it to somewhere like tmp and unzip it. jar xf genie-server-2.2.3.jar. After the jar is unzipped you’ll find the documentation webpage in META-INF/resources/docs/api/index.html. Modify all occurrences of localhost and 7001 to match your deployment. Zip the files back up into a jar.
The whole process described above should look something like this:
GENIE_SERVER_JAR_PATH=($CATALINA_HOME/webapps/ROOT/WEB-INF/lib/genie-server-*.jar)GENIE_SERVER_JAR_NAME=$(basename${GENIE_SERVER_JAR_PATH})mkdir /tmp/genie-server
mv${GENIE_SERVER_JAR_PATH} /tmp/genie-server/
pushd /tmp/genie-server/
jar xf ${GENIE_SERVER_JAR_NAME}rm${GENIE_SERVER_JAR_NAME}sed-i"s/localhost/${EC2_PUBLIC_HOSTNAME}/g" META-INF/resources/docs/api/index.html
jar cf ${GENIE_SERVER_JAR_NAME}*mv${GENIE_SERVER_JAR_NAME}${GENIE_SERVER_JAR_PATH}popd
rm-rf /tmp/genie-server
sed-i"s/localhost/${EC2_PUBLIC_HOSTNAME}/g"$CATALINA_HOME/webapps/ROOT/WEB-INF/classes/genie-swagger.xml
Once you’ve made these changes when you bring up Genie you can navigate to http://{genieHost}:{port}/docs/api to begin using the Swagger UI.
Download Genie Scripts
Genie leverages several scripts to launch and kill client processes when jobs are submitted. You should create a directory on your system (we’ll refer to this as GENIE_HOME) to store these scripts and you’ll need to reference this location in the property file configuration in the next section.
Download all the Genie scripts that are used to run jobs
On line 228 of joblauncher.sh you may have to modify the hadoop conf location. Older Hadoop distros have $HADOOP_HOME/conf/ and newer ones seem to store their conf files in $HADOOP_HOME/etc/hadoop/.
Modify Genie Properties
Genie properties will be located in $CATALINA_HOME/webapps/ROOT/WEB-INF/classes/genie.properties.
Environment specific properties will be in $CATALINA_HOME/webapps/ROOT/WEB-INF/classes/genie-{env}.properties. These properties will be loaded by Archaius using the archaius.deployment.environment property in CATALINA_OPTS.
You should review all the properties in the file but in particular pay attention to the following and set them as need be for your configuration.
For further information on customizing your Genie install see the customization section below.
Run Tomcat
Set CATALINA_OPTS to tell Archaius what properties to use as well as what Spring profile to use. By default Genie will use dev for the Spring profile. genie-dev.properties will override properties in genie.properties if -Darchaius.deployment.environment=dev is used. Below example sets Spring profile to prod which will use the prod database connection to MySQL (unless this was modified above).
You can view the various properties, jars, JMX metrics, etc from the admin console.
Additional Configuration
Database Configuration
Genie uses Spring for database connection setup and by default uses Derby database, which is not recommended for production use. At Netflix, we use MySQL RDS with DBCP2 for connection pooling.
You can look at the prod Spring profile in genie-jpa.xml for an example on how to set up MySQL/DBCP2.
Job Managers
Genie uses a set of classes which implement the Job Manager Interface , to implement the logic to run jobs on a particular Cluster type. This usually includes setting up environment variables and other environmental things before running a job or specific things needed to kill a job.
The YarnJobManagerImpl and PrestoJobManagerImpl are used to run jobs on clusters of types Yarn and Presto respectively. If you want to provide your own you can change these or implement new ones for different cluster types. This mapping is controlled by the following properties:
If you implement your own you’ll want to assign it a type. For example lets use Spark. You would add a property com.netflix.genie.server.job.manager.spark.impl and set it to the implementation of your class. Then when you configure your cluster for Spark you set the clusterType field to be spark. At runtime when this cluster is chosen Genie will look for the above new property and use Spring to create a new instance for running your job.
Copy and Mkdir Commands
Genie copies various files during job execution. Some of these files include cluster configurations, application clients, scripts, etc. To do this the various Job Managers set a copy command for the job launcher to use during execution. These commands by default in the properties file are set to use the Hadoop fs command. If the Hadoop binaries aren’t installed on your Genie node or you’d just prefer to plug in your own functionality you are free to do so. Your copy command just needs to take in the standard src and dst arguments. The mkdir command needs to take in the path of the directory to create.
Internally within Netflix we actually have a custom script which interacts directly with AWS and S3 rather than using the Hadoop commands.
If you develop your own Job Manager you’ll want to create a copy and mkdir command variables for that as well.
Running Jobs as Another User
Genie runs Hadoop jobs as the user and group specified using the HADOOP_USER_NAME environment variable. If you are running Genie on an instance that doesn’t have the users/groups, this is likely to fail. If you are comfortable, you may have Genie add users if they don’t exist already. You can do so by un-commenting the following lines in the joblauncher.sh (Note that the user running Genie must be a sudoer for this to work):
# Uncomment the following if you want Genie to create users if they don't exist alreadyecho"Create user.group $HADOOP_USER_NAME.$HADOOP_GROUP_NAME, if it doesn't exist already"sudo groupadd $HADOOP_GROUP_NAMEsudo useradd $HADOOP_USER_NAME-g$HADOOP_GROUP_NAME
Eureka Integration
Configure Genie Server
If you have multiple Genie instances that you want to use to load-balance your jobs, you can use Eureka as your discovery service. If you only have a single instance, you can safely skip this information.
In the genie.properties, set the following property to false com.netflix.karyon.eureka.disable=true, and then set up the eureka-client.properties
Before starting Tomcat, also append the following to CATALINA_OPTS (assuming you are running in the cloud)
For Genie clients, you need to add a [genieClient.properties] (https://github.com/Netflix/genie/blob/2.2.3/genie-common/src/main/resources/genie2Client.properties) to your CLASSPATH, with the following settings (NOTE: Assumes you’ve named your app genie, if you’ve named it genie2 change names of file and properties to match):
// Servers virtual addressgenieClient.ribbon.DeploymentContextBasedVipAddresses=genie.cloud.netflix.net:<your_tomcat_port>// Use Eureka/Discovery enabled load balancergenieClient.ribbon.NIWSServerListClassName=com.netflix.niws.loadbalancer.DiscoveryEnabledNIWSServerList
// Service URLs for the Eureka servereureka.serviceUrl.default=http://<EUREKA_SERVER_HOST>:<EUREKA_SERVER_PORT>/eureka/v2/
More Eureka Information
For more details on how to set up Eureka, please consult the Eureka Wiki.
Setting up Job Forwarding Between Nodes
Genie has a capability to automatically load-balance between nodes, if it has Eureka integration enabled. If you have Eureka integration enabled, review/update the following properties in genie.properties:
// max running jobs on this instance, after which 503s are throwncom.netflix.genie.server.max.running.jobs=30// number of running jobs on instance, after which to start forwarding to other instances// to disable auto-forwarding of jobs, set this to higher than com.netflix.genie.server.max.running.jobscom.netflix.genie.server.forward.jobs.threshold=20// find an idle instance with fewer running jobs than current, by this delta// e.g. if com.netflix.genie.server.forward.jobs.threshold=20, forward jobs to an instance// with number of running jobs < (20-com.netflix.genie.server.idle.host.threshold.delta)com.netflix.genie.server.idle.host.threshold.delta=5// max running jobs on instance that jobs can be forwarded tocom.netflix.genie.server.max.idle.host.threshold=27
Cloud Security
AWS Keys
Assuming that your data is in S3 You need to set up AWS access keys so that the Hadoop, Hive and Pig jobs can do S3 listings, reads and writes.
If you choose to put your keys in the *-site.xml’s on S3(if you are using YARN), you may want to look into S3 Server Side Encryption to encrypt data at rest. You should also get and put the config files from S3 to/from your cloud instances securely using https/ssl - EMR’s S3 file system already uses https as default (using the fs.s3n.ssl.enabled property, which is set to true by default).
Alternatively, you may simply add the fs.s3n.awsAccessKeyId and fs.s3n.awsSecretAccessKey properties to your core-default.xml for the Hadoop installation on the Genie server.
Security Groups
The Genie server needs to have access to the various daemons of your running clusters (eg: Resource Manager for YARN clusters). Please consult the Amazon EC2 Security Group docs to enable such access.