Recently, we expired a GPFS fault caused by some sort of LSI9285-8e glitch that happened during a regular patrol read. The fault has not been reproductable. This is what the syslog message from GPFS look like:
Sep 2 20:17:31 foo01 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=21232\
32: Unrecoverable file system operation error. Status code 19. Volume foodata1
Sep 2 20:21:39 foo01 mmfs: Error=MMFS_DISKFAIL, ID=0x9C6C05FA, Tag=2123233: \
Disk failure. Volume decdata1. rc = 19. Physical volume nsd1
Sep 2 20:21:39 foo01 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=21232\
34: Unrecoverable file system operation error. Status code 19. Volume foodata1
As you can see, nsd1 is not available:
[root@foonsd1 ~]# mmlsdisk foodata1
disk driver sector failure holds holds storage
name type size group metadata data status availability pool
------------ -------- ------ ------- -------- ----- ------------- ------------ ------------
nsd1 nsd 512 1 Yes Yes ready down system
nsd2 nsd 512 2 Yes Yes ready up system
nsd3 nsd 512 1 Yes Yes ready up system
nsd4 nsd 512 2 Yes Yes ready up system
mmchdisk
needs to be run to re-enable the downed disk. This operation is functionally similar to mounting a non-distributed filesystem that was not umounted cleanly.
[root@foonsd1 log]# mmchdisk foodata1 start -d nsd1
Scanning file system metadata, phase 1 ...
81 % complete on Tue Sep 4 10:17:53 2012
100 % complete on Tue Sep 4 10:17:54 2012
Scan completed successfully.
Scanning file system metadata, phase 2 ...
Scan completed successfully.
Scanning file system metadata, phase 3 ...
Scan completed successfully.
Scanning file system metadata, phase 4 ...
Scan completed successfully.
Scanning user file metadata ...
100.00 % complete on Tue Sep 4 10:18:03 2012
Scan completed successfully.
Now we want to fsck the entire filesystem with mmfsck
. Note that the -t
argument is a path for tempary working files. Obivously, this can’t be on the filesystem your fscking.
[root@foonsd1 log]# mmfsck foodata1 -v -o -t /home/gpfs/
Checking "foodata1"
fsckFlags 0x18
needNewLogs 0
nThreads 8
clientTerm 0
fsckReady 1
fsckCreated 0
% pool allowed 50
tuner off
threshold 0.20
Disks 4
Bytes per subblock 131072 131072
Sectors per subblock 256 1654712940
Sectors per indirect block 64
Subblocks per block 32
Subblocks per indirect block 1
Inodes 7372800
Inode size 512
singleINum -1
Inode regions 131
maxInodesPerSegment 522240
Segments per inode region 1
Bytes per inode segment 4194304
nInode0Files 1
Memory available per pass 4214505436
Regions per pass of pool system 1124
fsckStatus 2
lrOwned -1
hrOwned -1
PA size 0
PA map size 0
PA OptimalInodes 0
Inodes per inode block 8192
Data ptrs per inode 16
Indirect ptrs per inode 16
Data ptrs per indirect 1363
User files exposed some
Meta files exposed some
User files ill replicated some
Meta files ill replicated some
User files unbalanced all
Meta files unbalanced all
Current snapshots 0
Max snapshots 256
checkFilesets 1
checkFilesetsV2 1
Worker node 0
Checking inodes
Regions 0 to 1123 of total 1124 in storage pool "system".
Node x.x.27.29 (foo09) starting inode scan 0 to 65535
[lots more output about inode scanning...]
Lost blocks were found.
Correct the allocation map? y
292765696 subblocks
62243195 allocated
32010 unreferenced
32010 deallocated
2531993 addresses
0 suspended
File system is clean.
Exit status 0:10:0.
And now we’re ready to remount the filessytem.
[root@foonsd1 log]# mmlsdisk foodata1
disk driver sector failure holds holds storage
name type size group metadata data status availability pool
------------ -------- ------ ------- -------- ----- ------------- ------------ ------------
nsd1 nsd 512 1 Yes Yes ready up system
nsd2 nsd 512 2 Yes Yes ready up system
nsd3 nsd 512 1 Yes Yes ready up system
nsd4 nsd 512 2 Yes Yes ready up system
[root@foonsd1 log]# mmmount all -a
Tue Sep 4 10:21:19 MST 2012: mmmount: Mounting file systems ...
[root@foonsd1 log]# mmlsmount all
File system foodata1 is mounted on 18 nodes.