upgrading from hdp 2.1 to hdp 2.2
Post on 12-Jul-2015
946 Views
Preview:
TRANSCRIPT
Upgradingfrom HDP2.1 to HDP2.2
2014/12/18@tagomoris
HadoopSCR #hadoopreading
Satoshi Tagomori (@tagomoris)LINE Corp.
Analysis2 (CDH4)
Hadoop Cluster SwitchingLong running CDH4 cluster
Switching to new cluster
w/ Fast network, Large HDD, Many CPU core
changing Hive table schema/file formats
No downtime!
MRv1/HDFS
Hive
Distribution Options
Options at Oct 2014
CDH5
HDP2.1
Apache Hadoop Release
Hive 0.13, Tez -> HDP2.1 !
input datafluent-plugin-webhdfs
Shib
executing queriesover hiveserver1/2
Analysis2 (CDH4)
MRv1/HDFS
Hive
double write
Shib
Analysis2 (CDH4)
MRv1/HDFS
Hive
Analysis3 (HDP2.1)
MRv2/HDFS
Hivedistcp
Nov-Dec 2014
HDP 2.1.5.0
Install over Ansible, w/o Ambari
for configuration versioning
Hadoop 2.4.0
YARN RM-HA + Namenode HA
Hive 0.13
Tez?
Shib
Analysis2 (CDH4)
MRv1/HDFS
Hive
Analysis3 (HDP2.1)
MRv2/HDFS
Hive
Few days later (not yet)
HDP 2.2!
Hadoop 2.6.0
Datanode hot swap drive
YARN ResourceManager REST API
Hive 0.14.0 (...)
Latest Tez
diff HDP2.1 HDP2.2
hadoop-yarn-2.4.0.2.1.5.0-695.el6
-> hadoop-yarn-2.6.0.2.2.0.0-2041.el6
+ hadoop_2_2_0_0_2041-yarn-2.6.0.2.2.0.0-2041.el6
/usr/lib/hadoop/....
-> /usr/hdp/current/hadoop-*
diff HDP2.1 HDP2.2
Toooooooooooooo many diff lines
Companion files of HDP (2.1 -> 2.2)
in hive-site.xml: 353 -> 1207 lines
in tez-site.xml: 126 -> 261 lines
How to edit/control?
IDE? Editor? KIAI? Excel?
hadoop_xml_diff.rb
http://d.hatena.ne.jp/tagomoris/20141215/1418631988
Upgrade test in test clusterAutomated Upgrade by Ansible playbook
stop hiveserver2stop cluster
-safemode enter, -saveNamespacemake backup (hdfs metadata, hive metastore)-finalizeUpgradenm, rm, dn, nn, zkfc, jn, zkcheck edits stopped
Upgrade yum repo/packages/configurationsexecute hdp-selectStart cluster
zk, jn
“hdfs namenode -upgrade”
Upgrade in test clusterAutomated Upgrade by Ansible playbook
stop hiveserver2stop cluster
-safemode enter, -saveNamespacemake backup (hdfs metadata, hive metastore)-finalizeUpgradenm, rm, dn, nn, zkfc, jn, zkcheck edits stopped
Upgrade yum repo/packages/configurationsexecute hdp-selectStart cluster
zk, jn
“hdfs namenode -upgrade” ... ever lasting ...
“Ah, I might make any mistakes...”
double write
Shib
Analysis2 (CDH4)
MRv1/HDFS
Hive
Analysis3 (HDP2.2)
MRv2/HDFS
Hive
Upgrade HDP 2.1->2.2Dec 16 2014
Upgrade in analysis3Manual Procedure!!!
stop hiveserver2stop cluster
-safemode enter, -saveNamespacemake backup (hdfs metadata, hive metastore)-finalizeUpgradenm, rm, dn, nn, zkfc, jn, zkcheck edits stopped
Upgrade yum repo/packages/configurationsexecute hdp-selectStart cluster
zk, jn
/usr/hdp/current/hadoop-client/sbin/hadoop-daemon.sh \start namenode -upgrade
2014-12-16 14:53:28,919 INFO namenode.NNUpgradeUtil (NNUpgradeUtil.java:doUpgrade(139)) - Performing upgrade of storage directory /var/hadoop/hdfs/nn2014-12-16 14:53:28,939 INFO namenode.FSNamesystem (FSNamesystem.java:loadFSImage(1029)) - Need to save fs image? false (staleImage=false, haEnabled=true, isRollingUpgrade=false)2014-12-16 14:53:28,941 INFO namenode.FSEditLog (FSEditLog.java:startLogSegment(1173)) - Starting log segment at 2627951392014-12-16 14:53:29,224 INFO namenode.NameCache (NameCache.java:initialized(143)) - initialized with 23408 entries 1740524 lookups2014-12-16 14:53:29,227 INFO namenode.FSNamesystem (FSNamesystem.java:loadFromDisk(748)) - Finished loading FSImage in 15695 msecs2014-12-16 14:53:29,346 INFO namenode.NameNode (NameNodeRpcServer.java:<init>(329)) - RPC server is binding to master1.local:80202014-12-16 14:53:29,348 INFO ipc.CallQueueManager (CallQueueManager.java:<init>(53)) - Using callQueue class java.util.concurrent.LinkedBlockingQueue2014-12-16 14:53:29,390 INFO ipc.Server (Server.java:run(827)) - IPC Server Responder: starting2014-12-16 14:53:29,390 INFO ipc.Server (Server.java:run(674)) - IPC Server listener on 8020: starting2014-12-16 14:53:29,393 INFO namenode.NameNode (NameNode.java:startCommonServices(646)) - NameNode RPC up at: master1.local/10.0.0.0:80202014-12-16 14:53:29,393 INFO namenode.FSNamesystem (FSNamesystem.java:startActiveServices(1142)) - Starting services required for active state2014-12-16 14:53:29,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(160)) - Starting CacheReplicationMonitor with interval 30000 milliseconds2014-12-16 14:53:29,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 13919829439 milliseconds2014-12-16 14:53:29,576 INFO fs.TrashPolicyDefault (TrashPolicyDefault.java:initialize(92)) - Namenode trash configuration: Deletion interval = 360 minutes, Emptier interval = 0 minutes.2014-12-16 14:53:29,576 INFO fs.TrashPolicyDefault (TrashPolicyDefault.java:<init>(247)) - The configured checkpoint interval is 0 minutes. Using an interval of 360 minutes that is used for deletion instead2014-12-16 14:53:29,584 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 189 millisecond(s).2014-12-16 14:53:59,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30000 milliseconds2014-12-16 14:53:59,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).2014-12-16 14:54:29,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30000 milliseconds2014-12-16 14:54:29,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 2 millisecond(s).2014-12-16 14:54:59,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30000 milliseconds2014-12-16 14:54:59,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).2014-12-16 14:55:29,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30000 milliseconds2014-12-16 14:55:29,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).2014-12-16 14:55:59,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30001 milliseconds2014-12-16 14:55:59,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).2014-12-16 14:56:29,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30000 milliseconds2014-12-16 14:56:29,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).2014-12-16 14:56:59,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30000 milliseconds2014-12-16 14:56:59,398 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s).2014-12-16 14:57:29,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30000 milliseconds2014-12-16 14:57:29,398 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 2 millisecond(s).2014-12-16 14:57:59,396 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(178)) - Rescanning after 30000 milliseconds2014-12-16 14:57:59,397 INFO blockmanagement.CacheReplicationMonitor (CacheReplicationMonitor.java:run(201)) - Scanned 0 directive(s) and 0 block(s) in 1 millisecond(s). (ever lasting...)
https://gist.github.com/tagomoris/ed7aa8ccb3d6003a29f9
ever lasting!!!!!!!!
${dfs.namenode.name.dir}/current and .../previous are not modified anymore in 60 minutes...
rollbackstop all daemonsreplace all packages w/ HDP2.1replace configurations for HDP2.1/usr/lib/hadoop/sbin/hadoop-daemon.sh --config ... start namenode -rollback
$ /usr/lib/hadoop/sbin/hadoop-daemon.sh --config /etc/hadoop/conf start namenode -rollbackstarting namenode, logging to /var/log/hadoop/hdfs/hadoop-hdfs-namenode-4c3bf0834.livedoor.out"rollBack" will remove the current state of the file system,returning you to the state prior to initiating your recent.upgrade. This action is permanent and cannot be undone. If youare performing a rollback in an HA environment, you should becertain that no NameNode process is running on any host.Roll back file system state? (Y or N) Invalid input: Roll back file system state? (Y or N) Invalid input: Roll back file system state? (Y or N) Invalid input: Roll back file system state? (Y or N) Invalid input: Roll back file system state? (Y or N) Invalid input: Roll back file system state? (Y or N) Invalid input: $
impossible
I cannot input any “Y”s...
Recovery
replace namenode metadata w/ backup
execute NameNode (HDP 2.1) & DataNode
cluster recovered!!!!
Recovery
replace namenode metadata w/ backup
execute NameNode (HDP 2.1) & DataNode
cluster recovered!!!!
Replication numbers of all blocks are ZERO!!!!!!!1!!!!1!
Recovery
replace namenode metadata w/ backup
execute NameNode (HDP 2.1) & DataNode
cluster recovered!!!!
replication numbers of all blocks are ZERO!!!!!!!1!!!!1!
hadoop fsck / -> all blocks become fine!
Conclusion
I will wait for anyone who uses HDP 2.2...
top related