生产环境遇到下面报错
2025-04-23 17:44:15,780 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router1:8888->hh-fed-sub25:nn2:nn2:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router1:8888->hh-fed-sub25:nn1:nn1:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router2:8888->hh-fed-sub25:nn1:nn1:8020-EXPIRED
2025-04-23 17:44:15,781 INFO store.CachedRecordStore (CachedRecordStore.java:overrideExpiredRecords(192)) - Override State Store record MembershipState: router2:8888->hh-fed-sub25:nn2:nn2:8020-EXPIRED
报错原因是,之前子集群配置了3个router,2个nn,然后会向StateStore中存储6个MembershipState。
后来,将子集群的router停了两个,只运行一个router,这样的后果就是会在运行的router日志发现上面报错。
因为router会周期性下载MembershipState,每次都会去检查是否过期,而我们停了2个Router,这俩Router之前和NameNode形成Membership并上报到了StateStore,并且我们关闭了删除过期记录的参数dfs.federation.router.store.membership.expiration.deletion,所以,会在运行的Router中打印上面报错。
修复做法,选择下面之一都可以:
- 开启删除过期参数
- dfs.federation.router.store.membership.expiration默认未5min,若设置dfs.federation.router.store.membership.expiration.deletion=2min,则表示membership过期了(超过5min没汇报),在等2min就删除它。
- 启动已停止的router
参考源码
org.apache.hadoop.hdfs.server.federation.store.CachedRecordStore#overrideExpiredRecords
public void overrideExpiredRecords(QueryResult<R> query) throws IOException {List<R> commitRecords = new ArrayList<>();List<R> deleteRecords = new ArrayList<>();List<R> newRecords = query.getRecords();long currentDriverTime = query.getTimestamp();if (newRecords == null || currentDriverTime <= 0) {LOG.error("Cannot check overrides for record");return;}for (R record : newRecords) {if (record.shouldBeDeleted(currentDriverTime)) {String recordName = StateStoreUtils.getRecordName(record.getClass());if (getDriver().remove(record)) {deleteRecords.add(record);LOG.info("Deleted State Store record {}: {}", recordName, record);} else {LOG.warn("Couldn't delete State Store record {}: {}", recordName,record);}} else if (record.checkExpired(currentDriverTime)) {String recordName = StateStoreUtils.getRecordName(record.getClass());LOG.info("Override State Store record {}: {}", recordName, record);commitRecords.add(record);}}if (commitRecords.size() > 0) {getDriver().putAll(commitRecords, true, false);}if (deleteRecords.size() > 0) {newRecords.removeAll(deleteRecords);}}
org.apache.hadoop.hdfs.server.federation.store.records.BaseRecord#checkExpired
@Overridepublic boolean checkExpired(long currentTime) {if (super.checkExpired(currentTime)) {this.setState(EXPIRED);// Commit itreturn true;}return false;}public boolean checkExpired(long currentTime) {long expiration = getExpirationMs();long modifiedTime = getDateModified();if (modifiedTime > 0 && expiration > 0) {return (modifiedTime + expiration) < currentTime;}return false;}
org.apache.hadoop.hdfs.server.federation.store.records.BaseRecord#shouldBeDeleted
public boolean shouldBeDeleted(long currentTime) {long deletionTime = getDeletionMs();if (isExpired() && deletionTime > 0) {long elapsedTime = currentTime - (getDateModified() + getExpirationMs());return elapsedTime > deletionTime;} else {return false;}
}