Recently, we expired a GPFS fault caused by some sort of LSI9285-8e glitch that happened during a regular patrol read. The fault has not been reproductable. This is what the syslog message from GPFS look like:
Sep 2 20:17:31 foo01 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=21232\ 32: Unrecoverable file system operation error. Status code 19. Volume foodata1 Sep 2 20:21:39 foo01 mmfs: Error=MMFS_DISKFAIL, ID=0x9C6C05FA, Tag=2123233: \ Disk failure. Volume decdata1. rc = 19. Physical volume nsd1 Sep 2 20:21:39 foo01 mmfs: Error=MMFS_SYSTEM_UNMOUNT, ID=0xC954F85D, Tag=21232\ 34: Unrecoverable file system operation error. Status code 19. Volume foodata1
As you can see, nsd1 is not available:
[root@foonsd1 ~]# mmlsdisk foodata1 disk driver sector failure holds holds storage name type size group metadata data status availability pool ------------ -------- ------ ------- -------- ----- ------------- ------------ ------------ nsd1 nsd 512 1 Yes Yes ready down system nsd2 nsd 512 2 Yes Yes ready up system nsd3 nsd 512 1 Yes Yes ready up system nsd4 nsd 512 2 Yes Yes ready up system
mmchdisk
needs to be run to re-enable the downed disk. This operation is functionally similar to mounting a non-distributed filesystem that was not umounted cleanly.
[root@foonsd1 log]# mmchdisk foodata1 start -d nsd1 Scanning file system metadata, phase 1 ... 81 % complete on Tue Sep 4 10:17:53 2012 100 % complete on Tue Sep 4 10:17:54 2012 Scan completed successfully. Scanning file system metadata, phase 2 ... Scan completed successfully. Scanning file system metadata, phase 3 ... Scan completed successfully. Scanning file system metadata, phase 4 ... Scan completed successfully. Scanning user file metadata ... 100.00 % complete on Tue Sep 4 10:18:03 2012 Scan completed successfully.
Now we want to fsck the entire filesystem with mmfsck
. Note that the -t
argument is a path for tempary working files. Obivously, this can’t be on the filesystem your fscking.
[root@foonsd1 log]# mmfsck foodata1 -v -o -t /home/gpfs/ Checking "foodata1" fsckFlags 0x18 needNewLogs 0 nThreads 8 clientTerm 0 fsckReady 1 fsckCreated 0 % pool allowed 50 tuner off threshold 0.20 Disks 4 Bytes per subblock 131072 131072 Sectors per subblock 256 1654712940 Sectors per indirect block 64 Subblocks per block 32 Subblocks per indirect block 1 Inodes 7372800 Inode size 512 singleINum -1 Inode regions 131 maxInodesPerSegment 522240 Segments per inode region 1 Bytes per inode segment 4194304 nInode0Files 1 Memory available per pass 4214505436 Regions per pass of pool system 1124 fsckStatus 2 lrOwned -1 hrOwned -1 PA size 0 PA map size 0 PA OptimalInodes 0 Inodes per inode block 8192 Data ptrs per inode 16 Indirect ptrs per inode 16 Data ptrs per indirect 1363 User files exposed some Meta files exposed some User files ill replicated some Meta files ill replicated some User files unbalanced all Meta files unbalanced all Current snapshots 0 Max snapshots 256 checkFilesets 1 checkFilesetsV2 1 Worker node 0 Checking inodes Regions 0 to 1123 of total 1124 in storage pool "system". Node x.x.27.29 (foo09) starting inode scan 0 to 65535 [lots more output about inode scanning...] Lost blocks were found. Correct the allocation map? y 292765696 subblocks 62243195 allocated 32010 unreferenced 32010 deallocated 2531993 addresses 0 suspended File system is clean. Exit status 0:10:0.
And now we’re ready to remount the filessytem.
[root@foonsd1 log]# mmlsdisk foodata1 disk driver sector failure holds holds storage name type size group metadata data status availability pool ------------ -------- ------ ------- -------- ----- ------------- ------------ ------------ nsd1 nsd 512 1 Yes Yes ready up system nsd2 nsd 512 2 Yes Yes ready up system nsd3 nsd 512 1 Yes Yes ready up system nsd4 nsd 512 2 Yes Yes ready up system [root@foonsd1 log]# mmmount all -a Tue Sep 4 10:21:19 MST 2012: mmmount: Mounting file systems ... [root@foonsd1 log]# mmlsmount all File system foodata1 is mounted on 18 nodes.