ReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses.
https://github.com/airbnb/reair.git
ReAir is a collection of easy-to-use tools for replicating tables and partitions between Hive data warehouses. These tools are targeted at developers that already have some familiarity with operating warehouses based on Hadoop and Hive.
The replication features in ReAir are useful for the following use cases:
While many organizations start out with a single Hive warehouse, they often want better isolation between production and ad hoc workloads. Two isolated Hive warehouses accommodate this need well, and with two warehouses, there is a need to replicate evolving datasets. ReAir can be used to replicate data from one warehouse to another and propagate updates incrementally as they occur.
Lastly, ReAir can be used to replicated datasets to a hot-standby warehouse for fast failover in disaster recovery scenarios.
To accommodate these use cases, ReAir includes both batch and incremental replication tools. Batch replication executes a one-time copy of a list of tables. Incremental replication is a long-running process that copies objects as they are created or changed on the source warehouse.
cd reair
./gradlew shadowjar -p main -x test
my_db1.my_table1
my_db2.my_table2
hadoop jar command on the destination, specifying the config file and the list of tables to copy. A larger heap for the client may be needed for large batches, so set HADOOP_HEAPSIZE appropriately. Also, depending on how the warehouse is set up, you may need to run the process as a different user (e.g. hive).export HADOOP_OPTS="-Dlog4j.configuration=file://<path to log4j.properties>"
export HADOOP_HEAPSIZE=8096
sudo -u hive hadoop jar main/build/libs/airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.batch.hive.MetastoreReplicationJob --config-files my_config_file.xml --table-list my_tables_to_copy.txt
--step, --override-input. These arguments are useful if want to run one of the three MR job individually for faster failure recovery. --step indicates which step to run. --override-input provides the path for the input when running the second and third stage MR jobs. The input path will usually be the output for the first stage MR job.Incremental replication relies on recording changes in the source Hive warehouse to figure out what needs to be replicated. These changes can be recorded in two different ways. In the first method, the hook is added to the Hive CLI and runs after a query is successful. In the other method, the hook is added as a listener in the Hive remote metastore server. This method requires that you have the metastore server deployed and used by Hive, but it will work when systems other than Hive (e.g. Spark) make calls to the metastore server to create tables. The steps to deploy either hook are similar:
Build and deploy the JAR containing the audit log hook
cd reair
./gradlew shadowjar -p hive-hooks -x test
hive-hooks/build/libs/airbnb-reair-hive-hooks-1.0.0-all.jar.hive.aux.jars.path. If you're deploying the hook for the CLI, you only have to deploy the JAR on the hosts where the CLI will be run, and likewise, if you're deploying the hook for the metastore server, you only have to deploy the JAR on the server host.hive-site.xml from the audit log configuration template after replacing with appropriate values.hive-site.xml from the metastore audit log configuration template after replacing with appropriate values.audit_log and audit_objects tables.'-x test' flag).cd reair
./gradlew shadowjar -p main -x test
Once the build finishes, the JAR to run the incremental replication process can be found under main/build/libs/airbnb-reair-main-1.0.0-all.jar
hadoop jar command on the destination cluster. An example log4j.properties file is provided here. Be sure to specify the configuration file that was filled out in the prior step. As with batch replication, you may need to run the process as a different user.export HADOOP_OPTS="-Dlog4j.configuration=file://<path to log4j.properties>"
sudo -u hive hadoop jar airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.incremental.deploy.ReplicationLauncher --config-files my_config_file.xml
If you use the recommended log4j.properties file that is shipped with the tool, messages with the INFO level will be printed to stderr, but more detailed logging messages with >= DEBUG logging level will be recorded to a log file in the current working directory.
When the incremental replication process is launched for the first time, it will start replicating entries after the highest numbered ID in the audit log. Because the process periodically checkpoints progress to the DB, it can be killed and will resume from where it left off when restarted. To override this behavior, please see the additional options section.
To force the process to start replicating entries after a particular audit log ID, you can pass the --start-after-id parameter:
export HADOOP_OPTS="-Dlog4j.configuration=file://<path to log4j.properties>"
hadoop jar main/build/libs/airbnb-reair-main-1.0.0-all.jar com.airbnb.reair.replication.deploy.ReplicationLauncher --config-files my_config_file.xml --start-after-id 123456
Replication entries that were started but not completed on the last invocation will be marked as aborted when you use --start-after-id to restart the process.
The incremental replication process starts a Thrift server that can be used to get metrics and view progress. The Thrift definition is provided here. A simple web server that displays progress has been included in the web-server module. To run the web server:
cd reair
gradlew shadowjar -p web-server -x test
web-server/build/libs/airbnb-reair-web-server-1.0.0-all.jar
java -jar airbnb-reair-web-server-1.0.0-all.jar --thrift-host localhost --thrift-port 9996 --http-port 8080
http://localhost:8080 to view the active and retired replication jobs.ReAir useful, please list yourself on this page!