gluster replication peer failure

Just a quick post because I couldn’t find anything on this in a quick Google search.

I was in the process of migrating data in a replicated volume from one machine to another when the destination machine was interrupted (it was actually rebooted by an automated process kicked off by another admin; that’s what poor communication gets you). Then the destination machine wouldn’t boot. This machine mounted several gluster volumes from localhost, but glusterd wouldn’t start, which caused the boot process to hang on mounting.

So, one lesson is that mounting from the load balancer address might be better than localhost. :)

Anyway, the error I was getting was that the directory didn’t exist. Specifically, I saw this at the end of /var/log/glusterfs/etc-glusterfs-glusterd.vol.log:

[2016-03-02 15:34:16.093779] I [MSGID: 106513] [glusterd-store.c:2047:glusterd_restore_op_version] 0-glusterd: retrieved op-version: 30706
[2016-03-02 15:34:16.500257] E [MSGID: 101032] [store.c:434:gf_store_handle_retrieve] 0-: Path corresponding to /var/lib/glusterd/vols/sec_backup/bricks/newhostname:-srv-gluster-bricks-sec_backup-brick1-data. [No such file or directory]
[2016-03-02 15:34:16.500312] E [MSGID: 106201] [glusterd-store.c:3042:glusterd_store_retrieve_volumes] 0-management: Unable to restore volume: sec_backup
[2016-03-02 15:34:16.500357] E [MSGID: 101019] [xlator.c:428:xlator_init] 0-management: Initialization of volume 'management' failed, review your volfile again
[2016-03-02 15:34:16.500374] E [graph.c:322:glusterfs_graph_init] 0-management: initializing translator failed
[2016-03-02 15:34:16.500383] E [graph.c:661:glusterfs_graph_activate] 0-graph: init failed
[2016-03-02 15:34:16.500979] W [glusterfsd.c:1236:cleanup_and_exit] (-->/usr/sbin/glusterd(glusterfs_volumes_init+0xda) [0x405cba] -->/usr/sbin/glusterd(glusterfs_process_volfp+0x116) [0x405b96] -->/usr/sbin/glusterd(cleanup_and_exit+0x65) [0x4059d5] ) 0-: received signum (0), shutting down

Sorry about the formatting of that. It should probably be in a scrollable div, but I’ll have to fix that later.

Apparently the volume file on the new machine had been updated, but the brick directory hadn’t been created yet. The rest of the nodes in the pool still reflected the old location, and the volume still worked fine. But glusterd just refused to start.

As it turns out, this was really simple to fix. The volume in question was sec_backup, so I just did:

sudo rm -fr /var/lib/glusterd/vols/sec_backup

When glusterd started back up, it recreated the directory from the copy on one of the other nodes, and all was fine.