In a previous article titled How to add disks to an existing GPFS filesystem, I discussed adding additional disks/NSDs to a preexisting GPFS 3.4.0 filesystem. It turns out that the configuration file format used by mmcrnsd
utility has completely changed between the 3.4.0.x and 3.5.0.0 releases. The old “DescFile” format used by has been replaced by the new “StanzaFile” format.
Here is a small portion of the mmcrnsd
man page provided by the package gpfs.base-3.4.0-8.x86_64
that discusses the “DescFile” format expected by the command.
Parameters
-F DescFile
Specifies the file containing the list of disk descriptors,
one per line. Disk descriptors have this format:
DiskName:ServerList::DiskUsage:FailureGroup:DesiredName:StoragePool
...
And here is the compared section of the mmcrnsd
man page provided by the package gpfs.base-3.5.0-4.x86_64
.
Parameters
-F StanzaFile
Specifies the file containing the NSD stanzas for the disks to
be created. NSD stanzas have this format:
%nsd: device=DiskName
nsd=NsdName
servers=ServerList
usage={dataOnly | metadataOnly | dataAndMetadata | descOnly}
failureGroup=FailureGroup
pool=StoragePool
...
Below is an example “StanzaFile” for a 3 server node GPFS cluster with 3 preexisting disks per node (no shared storage). I want to add an additional disk/NSD to all 3 systems and coincidentally the new block devices happens to be named /dev/sdf
on all systems. This will increase the total number of GPFS disks in the cluster from 9 to 12.
Note that the “StanzaFile” format is not sensitive to blank lines between declarations so I’ve used a blank line to separate %nsd
declarations for different nodes.
%nsd: device=sdc
nsd=foo_nsd1
servers=foo
usage=dataAndMetadata
failureGroup=1
pool=system
%nsd: device=sdd
nsd=foo_nsd2
servers=foo
usage=dataAndMetadata
failureGroup=1
pool=system
%nsd: device=sde
nsd=foo_nsd3
servers=foo
usage=dataAndMetadata
failureGroup=1
pool=system
%nsd: device=sdf
nsd=foo_nsd4
servers=foo
usage=dataAndMetadata
failureGroup=1
pool=system
%nsd: device=sdc
nsd=foo2_nsd1
servers=foo2
usage=dataAndMetadata
failureGroup=2
pool=system
%nsd: device=sdd
nsd=foo2_nsd2
servers=foo2
usage=dataAndMetadata
failureGroup=2
pool=system
%nsd: device=sde
nsd=foo2_nsd3
servers=foo2
usage=dataAndMetadata
failureGroup=2
pool=system
%nsd: device=sdf
nsd=foo2_nsd4
servers=foo2
usage=dataAndMetadata
failureGroup=2
pool=system
%nsd: device=sdc
nsd=foo3_nsd1
servers=foo3
usage=dataAndMetadata
failureGroup=3
pool=system
%nsd: device=sdd
nsd=foo3_nsd2
servers=foo3
usage=dataAndMetadata
failureGroup=3
pool=system
%nsd: device=sde
nsd=foo3_nsd3
servers=foo3
usage=dataAndMetadata
failureGroup=3
pool=system
%nsd: device=sdf
nsd=foo3_nsd4
servers=foo3
usage=dataAndMetadata
failureGroup=3
pool=system
This is the present state of the cluster.
[root@foo1 ~]# mmlsnsd -X
Disk name NSD volume ID Device Devtype Node name Remarks
---------------------------------------------------------------------------------------------------
foo1_nsd1 8CFC1C05507CC5D0 /dev/sda generic foo1.example.org server node
foo1_nsd2 8CFC1C05507CC5D1 /dev/sdb generic foo1.example.org server node
foo1_nsd3 8CFC1C05507CC5D2 /dev/sdc generic foo1.example.org server node
foo2_nsd1 8CFC1C08507CC5D3 /dev/sdc generic foo2.example.org server node
foo2_nsd2 8CFC1C08507CC5D6 /dev/sdd generic foo2.example.org server node
foo2_nsd3 8CFC1C08507CC5D8 /dev/sde generic foo2.example.org server node
foo3_nsd1 8CFC1C0B507CC600 /dev/sda generic foo3.example.org server node
foo3_nsd2 8CFC1C0B507CC602 /dev/sdb generic foo3.example.org server node
foo3_nsd3 8CFC1C0B507CC605 /dev/sdc generic foo3.example.org server node
And now we pass the “StanzaFile” that includes the new and old NSD definitions to mmcrnsd
.
[root@foo1 ~]# mmcrnsd -F stanzafile.txt
mmcrnsd: Processing disk sdc
mmcrnsd: Disk name foo1_nsd1 is already registered for use by GPFS.
mmcrnsd: Processing disk sdd
mmcrnsd: Disk name foo1_nsd2 is already registered for use by GPFS.
mmcrnsd: Processing disk sde
mmcrnsd: Disk name foo1_nsd3 is already registered for use by GPFS.
mmcrnsd: Processing disk sdf
mmcrnsd: Processing disk sdc
mmcrnsd: Disk name foo2_nsd1 is already registered for use by GPFS.
mmcrnsd: Processing disk sdd
mmcrnsd: Disk name foo2_nsd2 is already registered for use by GPFS.
mmcrnsd: Processing disk sde
mmcrnsd: Disk name foo2_nsd3 is already registered for use by GPFS.
mmcrnsd: Processing disk sdf
mmcrnsd: Processing disk sdc
mmcrnsd: Disk name foo3_nsd1 is already registered for use by GPFS.
mmcrnsd: Processing disk sdd
mmcrnsd: Disk name foo3_nsd2 is already registered for use by GPFS.
mmcrnsd: Processing disk sde
mmcrnsd: Disk name foo3_nsd3 is already registered for use by GPFS.
mmcrnsd: Processing disk sdf
mmcrnsd: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
This is what stanzafile.txt
looks like after having been processed by mmcrnsd
. Note that as with the “DescFile” format from GPFS 3.4.0, mmcrnsd
has commented out all of the preexisting NSDs.
# %nsd: device=sdc
# nsd=foo1_nsd1
# servers=foo1
# usage=dataAndMetadata
# failureGroup=1
# pool=system
# %nsd: device=sdd
# nsd=foo1_nsd2
# servers=foo1
# usage=dataAndMetadata
# failureGroup=1
# pool=system
# %nsd: device=sde
# nsd=foo1_nsd3
# servers=foo1
# usage=dataAndMetadata
# failureGroup=1
# pool=system
%nsd: device=sdf
nsd=foo1_nsd4
servers=foo1
usage=dataAndMetadata
failureGroup=1
pool=system
# %nsd: device=sdc
# nsd=foo2_nsd1
# servers=foo2
# usage=dataAndMetadata
# failureGroup=2
# pool=system
# %nsd: device=sdd
# nsd=foo2_nsd2
# servers=foo2
# usage=dataAndMetadata
# failureGroup=2
# pool=system
# %nsd: device=sde
# nsd=foo2_nsd3
# servers=foo2
# usage=dataAndMetadata
# failureGroup=2
# pool=system
%nsd: device=sdf
nsd=foo2_nsd4
servers=foo2
usage=dataAndMetadata
failureGroup=2
pool=system
# %nsd: device=sdc
# nsd=foo3_nsd1
# servers=foo3
# usage=dataAndMetadata
# failureGroup=3
# pool=system
# %nsd: device=sdd
# nsd=foo3_nsd2
# servers=foo3
# usage=dataAndMetadata
# failureGroup=3
# pool=system
# %nsd: device=sde
# nsd=foo3_nsd3
# servers=foo3
# usage=dataAndMetadata
# failureGroup=3
# pool=system
%nsd: device=sdf
nsd=foo3_nsd4
servers=foo3
usage=dataAndMetadata
failureGroup=3
pool=system
Here is a listing of defined NSDs in the cluster after mmcrnsd
has completed. Note the new X_nsd4 /dev/sdf
NSDs per node.
[root@foo1 ~]# mmlsnsd -X
Disk name NSD volume ID Device Devtype Node name Remarks
---------------------------------------------------------------------------------------------------
foo1_nsd1 8CFC1C05507CC5D0 /dev/sda generic foo1.example.org server node
foo1_nsd2 8CFC1C05507CC5D1 /dev/sdb generic foo1.example.org server node
foo1_nsd3 8CFC1C05507CC5D2 /dev/sdc generic foo1.example.org server node
foo1_nsd4 8CFC1C055122CB38 /dev/sdf generic foo1.example.org server node
foo2_nsd1 8CFC1C08507CC5D3 /dev/sdc generic foo2.example.org server node
foo2_nsd2 8CFC1C08507CC5D6 /dev/sdd generic foo2.example.org server node
foo2_nsd3 8CFC1C08507CC5D8 /dev/sde generic foo2.example.org server node
foo2_nsd4 8CFC1C085122CB3A /dev/sdf generic foo2.example.org server node
foo3_nsd1 8CFC1C0B507CC600 /dev/sda generic foo3.example.org server node
foo3_nsd2 8CFC1C0B507CC602 /dev/sdb generic foo3.example.org server node
foo3_nsd3 8CFC1C0B507CC605 /dev/sdc generic foo3.example.org server node
foo3_nsd4 8CFC1C0B5122CB3C /dev/sdf generic foo3.example.org server node
As with GPFS 3.4.0.x, the new disks are not yet part of an existing filesystem.
[root@foo1 ~]# mmlsdisk foo1
disk driver sector failure holds holds storage
name type size group metadata data status availability pool
------------ -------- ------ ------- -------- ----- ------------- ------------ ------------
foo1_nsd1 nsd 512 1 Yes Yes ready up system
foo1_nsd2 nsd 512 1 Yes Yes ready up system
foo1_nsd3 nsd 512 1 Yes Yes ready up system
foo2_nsd1 nsd 512 2 Yes Yes ready up system
foo2_nsd2 nsd 512 2 Yes Yes ready up system
foo2_nsd3 nsd 512 2 Yes Yes ready up system
foo3_nsd1 nsd 512 3 Yes Yes ready up system
foo3_nsd2 nsd 512 3 Yes Yes ready up system
foo3_nsd3 nsd 512 3 Yes Yes ready up system
Now we’re going to add those 3 new disks to a prexisting filesystem with the mmadddisk
. This works exactly the same as with GPFS 3.4.0.x accept mmadddisk
expects a processed “StanzaFile” instead of a “DescFile”.
[root@foo1 ~]# mmadddisk foo1 -F ./stanzafile.txt
The following disks of mss1 will be formatted on node foo2.example.org:
foo1_nsd4: size 31251759104 KB
foo2_nsd4: size 31251759104 KB
foo3_nsd4: size 31251759104 KB
Extending Allocation Map
Checking Allocation Map for storage pool system
58 % complete on Mon Feb 18 18:25:09 2013
100 % complete on Mon Feb 18 18:25:13 2013
Completed adding disks to file system mss1.
mmadddisk: Propagating the cluster configuration data to all
affected nodes. This is an asynchronous process.
Now the 3 additional disks should be visible as part of the foo1
filesystem. Note the 3 fooX_nsd4
listings at the bottom of the output from mmlsdisk
.
[root@foo1 work]# mmlsdisk mss1
disk driver sector failure holds holds storage
name type size group metadata data status availability pool
------------ -------- ------ ------- -------- ----- ------------- ------------ ------------
foo1_nsd1 nsd 512 1 Yes Yes ready up system
foo1_nsd2 nsd 512 1 Yes Yes ready up system
foo1_nsd3 nsd 512 1 Yes Yes ready up system
foo2_nsd1 nsd 512 2 Yes Yes ready up system
foo2_nsd2 nsd 512 2 Yes Yes ready up system
foo2_nsd3 nsd 512 2 Yes Yes ready up system
foo3_nsd1 nsd 512 3 Yes Yes ready up system
foo3_nsd2 nsd 512 3 Yes Yes ready up system
foo3_nsd3 nsd 512 3 Yes Yes ready up system
foo1_nsd4 nsd 512 1 Yes Yes ready up system
foo2_nsd4 nsd 512 2 Yes Yes ready up system
foo3_nsd4 nsd 512 3 Yes Yes ready up system
The foo1
filesystem now consists of 12 disks of equal size. However, 9 of the disks are at less than or equal to 20%
free capacity while 3 disks are almost completely unused.
[root@foo1 ~]# mmdf foo1
disk disk size failure holds holds free KB free KB
name in KB group metadata data in full blocks in fragments
--------------- ------------- -------- -------- ----- -------------------- -------------------
Disks in storage pool: system (Maximum disk size allowed is 233 TB)
foo1_nsd1 31251759104 1 Yes Yes 5145806848 ( 16%) 236481600 ( 1%)
foo1_nsd2 31251759104 1 Yes Yes 5142163456 ( 16%) 251675200 ( 1%)
foo1_nsd3 31251759104 1 Yes Yes 5240137728 ( 17%) 131968576 ( 0%)
foo1_nsd4 31251759104 1 Yes Yes 31249270784 (100%) 20928 ( 0%)
foo2_nsd2 31251759104 2 Yes Yes 5144201216 ( 16%) 176784256 ( 1%)
foo2_nsd3 31251759104 2 Yes Yes 5979107328 ( 19%) 114685120 ( 0%)
foo2_nsd4 31251759104 2 Yes Yes 31249854464 (100%) 17856 ( 0%)
foo2_nsd1 31251759104 2 Yes Yes 5144209408 ( 16%) 223169664 ( 1%)
foo3_nsd3 31251759104 3 Yes Yes 6262112256 ( 20%) 112285056 ( 0%)
foo3_nsd2 31251759104 3 Yes Yes 5143769088 ( 16%) 143519552 ( 0%)
foo3_nsd1 31251759104 3 Yes Yes 5142896640 ( 16%) 245539392 ( 1%)
foo3_nsd4 31251759104 3 Yes Yes 31249786880 (100%) 21120 ( 0%)
------------- -------------------- -------------------
(pool total) 375021109248 142093316096 ( 38%) 1636168320 ( 0%)
============= ==================== ===================
(total) 375021109248 142093316096 ( 38%) 1636168320 ( 0%)
Inode Information
-----------------
Number of used inodes: 51937690
Number of free inodes: 18177638
Number of allocated inodes: 70115328
Maximum number of inodes: 146800640
Here’s the exact same output rounded to the nearest TiB for bragging rights. 🙂
[root@foo1 ~]# mmdf foo1 --block-size 1T
disk disk size failure holds holds free TB free TB
name in TB group metadata data in full blocks in fragments
--------------- ------------- -------- -------- ----- -------------------- -------------------
Disks in storage pool: system (Maximum disk size allowed is 233 TB)
foo1_nsd1 30 1 Yes Yes 5 ( 16%) 1 ( 1%)
foo1_nsd2 30 1 Yes Yes 5 ( 16%) 1 ( 1%)
foo1_nsd3 30 1 Yes Yes 5 ( 17%) 1 ( 0%)
foo1_nsd4 30 1 Yes Yes 30 (100%) 1 ( 0%)
foo2_nsd2 30 2 Yes Yes 5 ( 16%) 1 ( 1%)
foo2_nsd3 30 2 Yes Yes 6 ( 19%) 1 ( 0%)
foo2_nsd4 30 2 Yes Yes 30 (100%) 1 ( 0%)
foo2_nsd1 30 2 Yes Yes 5 ( 16%) 1 ( 1%)
foo3_nsd3 30 3 Yes Yes 6 ( 20%) 1 ( 0%)
foo3_nsd2 30 3 Yes Yes 5 ( 16%) 1 ( 0%)
foo3_nsd1 30 3 Yes Yes 5 ( 16%) 1 ( 1%)
foo3_nsd4 30 3 Yes Yes 30 (100%) 1 ( 0%)
------------- -------------------- -------------------
(pool total) 350 133 ( 38%) 2 ( 0%)
============= ==================== ===================
(total) 350 133 ( 38%) 2 ( 0%)
Inode Information
-----------------
Number of used inodes: 51937685
Number of free inodes: 18177643
Number of allocated inodes: 70115328
Maximum number of inodes: 146800640
It’s not strictly necessary at this point but it may be desirable to restripe the filesystem to even the load across LUNs and possibly increase the performance of reading existing large files (by virtue of them becoming striped across additional LUNs). I like to limit the nodes participating in the restriping to just those acting as NSD servers. Eg.
time mmrestripefs mss1 -b -N foo1,foo2,foo3