In a previous article titled How to add disks to an existing GPFS filesystem, I discussed adding additional disks/NSDs to a preexisting GPFS 3.4.0 filesystem. It turns out that the configuration file format used by mmcrnsd
utility has completely changed between the 3.4.0.x and 3.5.0.0 releases. The old “DescFile” format used by has been replaced by the new “StanzaFile” format.
Here is a small portion of the mmcrnsd
man page provided by the package gpfs.base-3.4.0-8.x86_64
that discusses the “DescFile” format expected by the command.
Parameters -F DescFile Specifies the file containing the list of disk descriptors, one per line. Disk descriptors have this format: DiskName:ServerList::DiskUsage:FailureGroup:DesiredName:StoragePool ...
And here is the compared section of the mmcrnsd
man page provided by the package gpfs.base-3.5.0-4.x86_64
.
Parameters -F StanzaFile Specifies the file containing the NSD stanzas for the disks to be created. NSD stanzas have this format: %nsd: device=DiskName nsd=NsdName servers=ServerList usage={dataOnly | metadataOnly | dataAndMetadata | descOnly} failureGroup=FailureGroup pool=StoragePool ...
Below is an example “StanzaFile” for a 3 server node GPFS cluster with 3 preexisting disks per node (no shared storage). I want to add an additional disk/NSD to all 3 systems and coincidentally the new block devices happens to be named /dev/sdf
on all systems. This will increase the total number of GPFS disks in the cluster from 9 to 12.
Note that the “StanzaFile” format is not sensitive to blank lines between declarations so I’ve used a blank line to separate %nsd
declarations for different nodes.
%nsd: device=sdc nsd=foo_nsd1 servers=foo usage=dataAndMetadata failureGroup=1 pool=system %nsd: device=sdd nsd=foo_nsd2 servers=foo usage=dataAndMetadata failureGroup=1 pool=system %nsd: device=sde nsd=foo_nsd3 servers=foo usage=dataAndMetadata failureGroup=1 pool=system %nsd: device=sdf nsd=foo_nsd4 servers=foo usage=dataAndMetadata failureGroup=1 pool=system %nsd: device=sdc nsd=foo2_nsd1 servers=foo2 usage=dataAndMetadata failureGroup=2 pool=system %nsd: device=sdd nsd=foo2_nsd2 servers=foo2 usage=dataAndMetadata failureGroup=2 pool=system %nsd: device=sde nsd=foo2_nsd3 servers=foo2 usage=dataAndMetadata failureGroup=2 pool=system %nsd: device=sdf nsd=foo2_nsd4 servers=foo2 usage=dataAndMetadata failureGroup=2 pool=system %nsd: device=sdc nsd=foo3_nsd1 servers=foo3 usage=dataAndMetadata failureGroup=3 pool=system %nsd: device=sdd nsd=foo3_nsd2 servers=foo3 usage=dataAndMetadata failureGroup=3 pool=system %nsd: device=sde nsd=foo3_nsd3 servers=foo3 usage=dataAndMetadata failureGroup=3 pool=system %nsd: device=sdf nsd=foo3_nsd4 servers=foo3 usage=dataAndMetadata failureGroup=3 pool=system
This is the present state of the cluster.
[root@foo1 ~]# mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- foo1_nsd1 8CFC1C05507CC5D0 /dev/sda generic foo1.example.org server node foo1_nsd2 8CFC1C05507CC5D1 /dev/sdb generic foo1.example.org server node foo1_nsd3 8CFC1C05507CC5D2 /dev/sdc generic foo1.example.org server node foo2_nsd1 8CFC1C08507CC5D3 /dev/sdc generic foo2.example.org server node foo2_nsd2 8CFC1C08507CC5D6 /dev/sdd generic foo2.example.org server node foo2_nsd3 8CFC1C08507CC5D8 /dev/sde generic foo2.example.org server node foo3_nsd1 8CFC1C0B507CC600 /dev/sda generic foo3.example.org server node foo3_nsd2 8CFC1C0B507CC602 /dev/sdb generic foo3.example.org server node foo3_nsd3 8CFC1C0B507CC605 /dev/sdc generic foo3.example.org server node
And now we pass the “StanzaFile” that includes the new and old NSD definitions to mmcrnsd
.
[root@foo1 ~]# mmcrnsd -F stanzafile.txt mmcrnsd: Processing disk sdc mmcrnsd: Disk name foo1_nsd1 is already registered for use by GPFS. mmcrnsd: Processing disk sdd mmcrnsd: Disk name foo1_nsd2 is already registered for use by GPFS. mmcrnsd: Processing disk sde mmcrnsd: Disk name foo1_nsd3 is already registered for use by GPFS. mmcrnsd: Processing disk sdf mmcrnsd: Processing disk sdc mmcrnsd: Disk name foo2_nsd1 is already registered for use by GPFS. mmcrnsd: Processing disk sdd mmcrnsd: Disk name foo2_nsd2 is already registered for use by GPFS. mmcrnsd: Processing disk sde mmcrnsd: Disk name foo2_nsd3 is already registered for use by GPFS. mmcrnsd: Processing disk sdf mmcrnsd: Processing disk sdc mmcrnsd: Disk name foo3_nsd1 is already registered for use by GPFS. mmcrnsd: Processing disk sdd mmcrnsd: Disk name foo3_nsd2 is already registered for use by GPFS. mmcrnsd: Processing disk sde mmcrnsd: Disk name foo3_nsd3 is already registered for use by GPFS. mmcrnsd: Processing disk sdf mmcrnsd: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
This is what stanzafile.txt
looks like after having been processed by mmcrnsd
. Note that as with the “DescFile” format from GPFS 3.4.0, mmcrnsd
has commented out all of the preexisting NSDs.
# %nsd: device=sdc # nsd=foo1_nsd1 # servers=foo1 # usage=dataAndMetadata # failureGroup=1 # pool=system # %nsd: device=sdd # nsd=foo1_nsd2 # servers=foo1 # usage=dataAndMetadata # failureGroup=1 # pool=system # %nsd: device=sde # nsd=foo1_nsd3 # servers=foo1 # usage=dataAndMetadata # failureGroup=1 # pool=system %nsd: device=sdf nsd=foo1_nsd4 servers=foo1 usage=dataAndMetadata failureGroup=1 pool=system # %nsd: device=sdc # nsd=foo2_nsd1 # servers=foo2 # usage=dataAndMetadata # failureGroup=2 # pool=system # %nsd: device=sdd # nsd=foo2_nsd2 # servers=foo2 # usage=dataAndMetadata # failureGroup=2 # pool=system # %nsd: device=sde # nsd=foo2_nsd3 # servers=foo2 # usage=dataAndMetadata # failureGroup=2 # pool=system %nsd: device=sdf nsd=foo2_nsd4 servers=foo2 usage=dataAndMetadata failureGroup=2 pool=system # %nsd: device=sdc # nsd=foo3_nsd1 # servers=foo3 # usage=dataAndMetadata # failureGroup=3 # pool=system # %nsd: device=sdd # nsd=foo3_nsd2 # servers=foo3 # usage=dataAndMetadata # failureGroup=3 # pool=system # %nsd: device=sde # nsd=foo3_nsd3 # servers=foo3 # usage=dataAndMetadata # failureGroup=3 # pool=system %nsd: device=sdf nsd=foo3_nsd4 servers=foo3 usage=dataAndMetadata failureGroup=3 pool=system
Here is a listing of defined NSDs in the cluster after mmcrnsd
has completed. Note the new X_nsd4 /dev/sdf
NSDs per node.
[root@foo1 ~]# mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- foo1_nsd1 8CFC1C05507CC5D0 /dev/sda generic foo1.example.org server node foo1_nsd2 8CFC1C05507CC5D1 /dev/sdb generic foo1.example.org server node foo1_nsd3 8CFC1C05507CC5D2 /dev/sdc generic foo1.example.org server node foo1_nsd4 8CFC1C055122CB38 /dev/sdf generic foo1.example.org server node foo2_nsd1 8CFC1C08507CC5D3 /dev/sdc generic foo2.example.org server node foo2_nsd2 8CFC1C08507CC5D6 /dev/sdd generic foo2.example.org server node foo2_nsd3 8CFC1C08507CC5D8 /dev/sde generic foo2.example.org server node foo2_nsd4 8CFC1C085122CB3A /dev/sdf generic foo2.example.org server node foo3_nsd1 8CFC1C0B507CC600 /dev/sda generic foo3.example.org server node foo3_nsd2 8CFC1C0B507CC602 /dev/sdb generic foo3.example.org server node foo3_nsd3 8CFC1C0B507CC605 /dev/sdc generic foo3.example.org server node foo3_nsd4 8CFC1C0B5122CB3C /dev/sdf generic foo3.example.org server node
As with GPFS 3.4.0.x, the new disks are not yet part of an existing filesystem.
[root@foo1 ~]# mmlsdisk foo1 disk driver sector failure holds holds storage name type size group metadata data status availability pool ------------ -------- ------ ------- -------- ----- ------------- ------------ ------------ foo1_nsd1 nsd 512 1 Yes Yes ready up system foo1_nsd2 nsd 512 1 Yes Yes ready up system foo1_nsd3 nsd 512 1 Yes Yes ready up system foo2_nsd1 nsd 512 2 Yes Yes ready up system foo2_nsd2 nsd 512 2 Yes Yes ready up system foo2_nsd3 nsd 512 2 Yes Yes ready up system foo3_nsd1 nsd 512 3 Yes Yes ready up system foo3_nsd2 nsd 512 3 Yes Yes ready up system foo3_nsd3 nsd 512 3 Yes Yes ready up system
Now we’re going to add those 3 new disks to a prexisting filesystem with the mmadddisk
. This works exactly the same as with GPFS 3.4.0.x accept mmadddisk
expects a processed “StanzaFile” instead of a “DescFile”.
[root@foo1 ~]# mmadddisk foo1 -F ./stanzafile.txt The following disks of mss1 will be formatted on node foo2.example.org: foo1_nsd4: size 31251759104 KB foo2_nsd4: size 31251759104 KB foo3_nsd4: size 31251759104 KB Extending Allocation Map Checking Allocation Map for storage pool system 58 % complete on Mon Feb 18 18:25:09 2013 100 % complete on Mon Feb 18 18:25:13 2013 Completed adding disks to file system mss1. mmadddisk: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process.
Now the 3 additional disks should be visible as part of the foo1
filesystem. Note the 3 fooX_nsd4
listings at the bottom of the output from mmlsdisk
.
[root@foo1 work]# mmlsdisk mss1 disk driver sector failure holds holds storage name type size group metadata data status availability pool ------------ -------- ------ ------- -------- ----- ------------- ------------ ------------ foo1_nsd1 nsd 512 1 Yes Yes ready up system foo1_nsd2 nsd 512 1 Yes Yes ready up system foo1_nsd3 nsd 512 1 Yes Yes ready up system foo2_nsd1 nsd 512 2 Yes Yes ready up system foo2_nsd2 nsd 512 2 Yes Yes ready up system foo2_nsd3 nsd 512 2 Yes Yes ready up system foo3_nsd1 nsd 512 3 Yes Yes ready up system foo3_nsd2 nsd 512 3 Yes Yes ready up system foo3_nsd3 nsd 512 3 Yes Yes ready up system foo1_nsd4 nsd 512 1 Yes Yes ready up system foo2_nsd4 nsd 512 2 Yes Yes ready up system foo3_nsd4 nsd 512 3 Yes Yes ready up system
The foo1
filesystem now consists of 12 disks of equal size. However, 9 of the disks are at less than or equal to 20%
free capacity while 3 disks are almost completely unused.
[root@foo1 ~]# mmdf foo1 disk disk size failure holds holds free KB free KB name in KB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 233 TB) foo1_nsd1 31251759104 1 Yes Yes 5145806848 ( 16%) 236481600 ( 1%) foo1_nsd2 31251759104 1 Yes Yes 5142163456 ( 16%) 251675200 ( 1%) foo1_nsd3 31251759104 1 Yes Yes 5240137728 ( 17%) 131968576 ( 0%) foo1_nsd4 31251759104 1 Yes Yes 31249270784 (100%) 20928 ( 0%) foo2_nsd2 31251759104 2 Yes Yes 5144201216 ( 16%) 176784256 ( 1%) foo2_nsd3 31251759104 2 Yes Yes 5979107328 ( 19%) 114685120 ( 0%) foo2_nsd4 31251759104 2 Yes Yes 31249854464 (100%) 17856 ( 0%) foo2_nsd1 31251759104 2 Yes Yes 5144209408 ( 16%) 223169664 ( 1%) foo3_nsd3 31251759104 3 Yes Yes 6262112256 ( 20%) 112285056 ( 0%) foo3_nsd2 31251759104 3 Yes Yes 5143769088 ( 16%) 143519552 ( 0%) foo3_nsd1 31251759104 3 Yes Yes 5142896640 ( 16%) 245539392 ( 1%) foo3_nsd4 31251759104 3 Yes Yes 31249786880 (100%) 21120 ( 0%) ------------- -------------------- ------------------- (pool total) 375021109248 142093316096 ( 38%) 1636168320 ( 0%) ============= ==================== =================== (total) 375021109248 142093316096 ( 38%) 1636168320 ( 0%) Inode Information ----------------- Number of used inodes: 51937690 Number of free inodes: 18177638 Number of allocated inodes: 70115328 Maximum number of inodes: 146800640
Here’s the exact same output rounded to the nearest TiB for bragging rights. 🙂
[root@foo1 ~]# mmdf foo1 --block-size 1T disk disk size failure holds holds free TB free TB name in TB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 233 TB) foo1_nsd1 30 1 Yes Yes 5 ( 16%) 1 ( 1%) foo1_nsd2 30 1 Yes Yes 5 ( 16%) 1 ( 1%) foo1_nsd3 30 1 Yes Yes 5 ( 17%) 1 ( 0%) foo1_nsd4 30 1 Yes Yes 30 (100%) 1 ( 0%) foo2_nsd2 30 2 Yes Yes 5 ( 16%) 1 ( 1%) foo2_nsd3 30 2 Yes Yes 6 ( 19%) 1 ( 0%) foo2_nsd4 30 2 Yes Yes 30 (100%) 1 ( 0%) foo2_nsd1 30 2 Yes Yes 5 ( 16%) 1 ( 1%) foo3_nsd3 30 3 Yes Yes 6 ( 20%) 1 ( 0%) foo3_nsd2 30 3 Yes Yes 5 ( 16%) 1 ( 0%) foo3_nsd1 30 3 Yes Yes 5 ( 16%) 1 ( 1%) foo3_nsd4 30 3 Yes Yes 30 (100%) 1 ( 0%) ------------- -------------------- ------------------- (pool total) 350 133 ( 38%) 2 ( 0%) ============= ==================== =================== (total) 350 133 ( 38%) 2 ( 0%) Inode Information ----------------- Number of used inodes: 51937685 Number of free inodes: 18177643 Number of allocated inodes: 70115328 Maximum number of inodes: 146800640
It’s not strictly necessary at this point but it may be desirable to restripe the filesystem to even the load across LUNs and possibly increase the performance of reading existing large files (by virtue of them becoming striped across additional LUNs). I like to limit the nodes participating in the restriping to just those acting as NSD servers. Eg.
time mmrestripefs mss1 -b -N foo1,foo2,foo3