RTFM

[Read This Fine Material] from Joshua Hoblitt

How to add disks to an existing GPFS 3.5.0 filesystem

| 0 comments

In a previous article titled How to add disks to an existing GPFS filesystem, I discussed adding additional disks/NSDs to a preexisting GPFS 3.4.0 filesystem. It turns out that the configuration file format used by mmcrnsd utility has completely changed between the 3.4.0.x and 3.5.0.0 releases. The old “DescFile” format used by has been replaced by the new “StanzaFile” format.

Here is a small portion of the mmcrnsd man page provided by the package gpfs.base-3.4.0-8.x86_64 that discusses the “DescFile” format expected by the command.

       Parameters
          -F DescFile
                Specifies  the  file  containing the list of disk descriptors,
                one per line. Disk descriptors have this format:

                DiskName:ServerList::DiskUsage:FailureGroup:DesiredName:StoragePool
...

And here is the compared section of the mmcrnsd man page provided by the package gpfs.base-3.5.0-4.x86_64.

       Parameters
          -F StanzaFile
                Specifies the file containing the NSD stanzas for the disks to
                be created.  NSD stanzas have this format:

                %nsd: device=DiskName
                  nsd=NsdName
                  servers=ServerList
                  usage={dataOnly | metadataOnly | dataAndMetadata | descOnly}
                  failureGroup=FailureGroup
                  pool=StoragePool
...

Below is an example “StanzaFile” for a 3 server node GPFS cluster with 3 preexisting disks per node (no shared storage). I want to add an additional disk/NSD to all 3 systems and coincidentally the new block devices happens to be named /dev/sdf on all systems. This will increase the total number of GPFS disks in the cluster from 9 to 12.

Note that the “StanzaFile” format is not sensitive to blank lines between declarations so I’ve used a blank line to separate %nsd declarations for different nodes.

%nsd: device=sdc
  nsd=foo_nsd1
  servers=foo
  usage=dataAndMetadata
  failureGroup=1
  pool=system
%nsd: device=sdd
  nsd=foo_nsd2
  servers=foo
  usage=dataAndMetadata
  failureGroup=1
  pool=system
%nsd: device=sde
  nsd=foo_nsd3
  servers=foo
  usage=dataAndMetadata
  failureGroup=1
  pool=system
%nsd: device=sdf
  nsd=foo_nsd4
  servers=foo
  usage=dataAndMetadata
  failureGroup=1
  pool=system

%nsd: device=sdc
  nsd=foo2_nsd1
  servers=foo2
  usage=dataAndMetadata
  failureGroup=2
  pool=system
%nsd: device=sdd
  nsd=foo2_nsd2
  servers=foo2
  usage=dataAndMetadata
  failureGroup=2
  pool=system
%nsd: device=sde
  nsd=foo2_nsd3
  servers=foo2
  usage=dataAndMetadata
  failureGroup=2
  pool=system
%nsd: device=sdf
  nsd=foo2_nsd4
  servers=foo2
  usage=dataAndMetadata
  failureGroup=2
  pool=system

%nsd: device=sdc
  nsd=foo3_nsd1
  servers=foo3
  usage=dataAndMetadata
  failureGroup=3
  pool=system
%nsd: device=sdd
  nsd=foo3_nsd2
  servers=foo3
  usage=dataAndMetadata
  failureGroup=3
  pool=system
%nsd: device=sde
  nsd=foo3_nsd3
  servers=foo3
  usage=dataAndMetadata
  failureGroup=3
  pool=system
%nsd: device=sdf
  nsd=foo3_nsd4
  servers=foo3
  usage=dataAndMetadata
  failureGroup=3
  pool=system

This is the present state of the cluster.

[root@foo1 ~]# mmlsnsd -X

 Disk name    NSD volume ID      Device         Devtype  Node name                Remarks          
---------------------------------------------------------------------------------------------------
 foo1_nsd1 8CFC1C05507CC5D0   /dev/sda       generic  foo1.example.org     server node
 foo1_nsd2 8CFC1C05507CC5D1   /dev/sdb       generic  foo1.example.org     server node
 foo1_nsd3 8CFC1C05507CC5D2   /dev/sdc       generic  foo1.example.org     server node
 foo2_nsd1 8CFC1C08507CC5D3   /dev/sdc       generic  foo2.example.org     server node
 foo2_nsd2 8CFC1C08507CC5D6   /dev/sdd       generic  foo2.example.org     server node
 foo2_nsd3 8CFC1C08507CC5D8   /dev/sde       generic  foo2.example.org     server node
 foo3_nsd1 8CFC1C0B507CC600   /dev/sda       generic  foo3.example.org     server node
 foo3_nsd2 8CFC1C0B507CC602   /dev/sdb       generic  foo3.example.org     server node
 foo3_nsd3 8CFC1C0B507CC605   /dev/sdc       generic  foo3.example.org     server node

And now we pass the “StanzaFile” that includes the new and old NSD definitions to mmcrnsd.

[root@foo1 ~]# mmcrnsd -F stanzafile.txt
mmcrnsd: Processing disk sdc
mmcrnsd: Disk name foo1_nsd1 is already registered for use by GPFS.
mmcrnsd: Processing disk sdd
mmcrnsd: Disk name foo1_nsd2 is already registered for use by GPFS.
mmcrnsd: Processing disk sde
mmcrnsd: Disk name foo1_nsd3 is already registered for use by GPFS.
mmcrnsd: Processing disk sdf
mmcrnsd: Processing disk sdc
mmcrnsd: Disk name foo2_nsd1 is already registered for use by GPFS.
mmcrnsd: Processing disk sdd
mmcrnsd: Disk name foo2_nsd2 is already registered for use by GPFS.
mmcrnsd: Processing disk sde
mmcrnsd: Disk name foo2_nsd3 is already registered for use by GPFS.
mmcrnsd: Processing disk sdf
mmcrnsd: Processing disk sdc
mmcrnsd: Disk name foo3_nsd1 is already registered for use by GPFS.
mmcrnsd: Processing disk sdd
mmcrnsd: Disk name foo3_nsd2 is already registered for use by GPFS.
mmcrnsd: Processing disk sde
mmcrnsd: Disk name foo3_nsd3 is already registered for use by GPFS.
mmcrnsd: Processing disk sdf
mmcrnsd: Propagating the cluster configuration data to all
  affected nodes.  This is an asynchronous process.

This is what stanzafile.txt looks like after having been processed by mmcrnsd. Note that as with the “DescFile” format from GPFS 3.4.0, mmcrnsd has commented out all of the preexisting NSDs.

# %nsd: device=sdc
#   nsd=foo1_nsd1
#   servers=foo1
#   usage=dataAndMetadata
#   failureGroup=1
#   pool=system
# %nsd: device=sdd
#   nsd=foo1_nsd2
#   servers=foo1
#   usage=dataAndMetadata
#   failureGroup=1
#   pool=system
# %nsd: device=sde
#   nsd=foo1_nsd3
#   servers=foo1
#   usage=dataAndMetadata
#   failureGroup=1
#   pool=system
%nsd: device=sdf
  nsd=foo1_nsd4
  servers=foo1
  usage=dataAndMetadata
  failureGroup=1
  pool=system

# %nsd: device=sdc
#   nsd=foo2_nsd1
#   servers=foo2
#   usage=dataAndMetadata
#   failureGroup=2
#   pool=system
# %nsd: device=sdd
#   nsd=foo2_nsd2
#   servers=foo2
#   usage=dataAndMetadata
#   failureGroup=2
#   pool=system
# %nsd: device=sde
#   nsd=foo2_nsd3
#   servers=foo2
#   usage=dataAndMetadata
#   failureGroup=2
#   pool=system
%nsd: device=sdf
  nsd=foo2_nsd4
  servers=foo2
  usage=dataAndMetadata
  failureGroup=2
  pool=system

# %nsd: device=sdc
#   nsd=foo3_nsd1
#   servers=foo3
#   usage=dataAndMetadata
#   failureGroup=3
#   pool=system
# %nsd: device=sdd
#   nsd=foo3_nsd2
#   servers=foo3
#   usage=dataAndMetadata
#   failureGroup=3
#   pool=system
# %nsd: device=sde
#   nsd=foo3_nsd3
#   servers=foo3
#   usage=dataAndMetadata
#   failureGroup=3
#   pool=system
%nsd: device=sdf
  nsd=foo3_nsd4
  servers=foo3
  usage=dataAndMetadata
  failureGroup=3
  pool=system

Here is a listing of defined NSDs in the cluster after mmcrnsd has completed. Note the new X_nsd4 /dev/sdf NSDs per node.

[root@foo1 ~]# mmlsnsd -X

 Disk name    NSD volume ID      Device         Devtype  Node name                Remarks
---------------------------------------------------------------------------------------------------
 foo1_nsd1 8CFC1C05507CC5D0   /dev/sda       generic  foo1.example.org     server node
 foo1_nsd2 8CFC1C05507CC5D1   /dev/sdb       generic  foo1.example.org     server node
 foo1_nsd3 8CFC1C05507CC5D2   /dev/sdc       generic  foo1.example.org     server node
 foo1_nsd4 8CFC1C055122CB38   /dev/sdf       generic  foo1.example.org     server node
 foo2_nsd1 8CFC1C08507CC5D3   /dev/sdc       generic  foo2.example.org     server node
 foo2_nsd2 8CFC1C08507CC5D6   /dev/sdd       generic  foo2.example.org     server node
 foo2_nsd3 8CFC1C08507CC5D8   /dev/sde       generic  foo2.example.org     server node
 foo2_nsd4 8CFC1C085122CB3A   /dev/sdf       generic  foo2.example.org     server node
 foo3_nsd1 8CFC1C0B507CC600   /dev/sda       generic  foo3.example.org     server node
 foo3_nsd2 8CFC1C0B507CC602   /dev/sdb       generic  foo3.example.org     server node
 foo3_nsd3 8CFC1C0B507CC605   /dev/sdc       generic  foo3.example.org     server node
 foo3_nsd4 8CFC1C0B5122CB3C   /dev/sdf       generic  foo3.example.org     server node

As with GPFS 3.4.0.x, the new disks are not yet part of an existing filesystem.

[root@foo1 ~]# mmlsdisk foo1
disk         driver   sector failure holds    holds                            storage
name         type       size   group metadata data  status        availability pool
------------ -------- ------ ------- -------- ----- ------------- ------------ ------------
foo1_nsd1 nsd         512       1 Yes      Yes   ready         up           system          
foo1_nsd2 nsd         512       1 Yes      Yes   ready         up           system          
foo1_nsd3 nsd         512       1 Yes      Yes   ready         up           system          
foo2_nsd1 nsd         512       2 Yes      Yes   ready         up           system          
foo2_nsd2 nsd         512       2 Yes      Yes   ready         up           system          
foo2_nsd3 nsd         512       2 Yes      Yes   ready         up           system          
foo3_nsd1 nsd         512       3 Yes      Yes   ready         up           system          
foo3_nsd2 nsd         512       3 Yes      Yes   ready         up           system          
foo3_nsd3 nsd         512       3 Yes      Yes   ready         up           system        

Now we’re going to add those 3 new disks to a prexisting filesystem with the mmadddisk. This works exactly the same as with GPFS 3.4.0.x accept mmadddisk expects a processed “StanzaFile” instead of a “DescFile”.

[root@foo1 ~]# mmadddisk foo1 -F ./stanzafile.txt

The following disks of mss1 will be formatted on node foo2.example.org:
    foo1_nsd4: size 31251759104 KB 
    foo2_nsd4: size 31251759104 KB 
    foo3_nsd4: size 31251759104 KB 
Extending Allocation Map
Checking Allocation Map for storage pool system
  58 % complete on Mon Feb 18 18:25:09 2013
 100 % complete on Mon Feb 18 18:25:13 2013
Completed adding disks to file system mss1.
mmadddisk: Propagating the cluster configuration data to all
  affected nodes.  This is an asynchronous process.

Now the 3 additional disks should be visible as part of the foo1 filesystem. Note the 3 fooX_nsd4 listings at the bottom of the output from mmlsdisk.

[root@foo1 work]# mmlsdisk mss1
disk         driver   sector failure holds    holds                            storage
name         type       size   group metadata data  status        availability pool
------------ -------- ------ ------- -------- ----- ------------- ------------ ------------
foo1_nsd1 nsd         512       1 Yes      Yes   ready         up           system   
foo1_nsd2 nsd         512       1 Yes      Yes   ready         up           system   
foo1_nsd3 nsd         512       1 Yes      Yes   ready         up           system   
foo2_nsd1 nsd         512       2 Yes      Yes   ready         up           system   
foo2_nsd2 nsd         512       2 Yes      Yes   ready         up           system   
foo2_nsd3 nsd         512       2 Yes      Yes   ready         up           system   
foo3_nsd1 nsd         512       3 Yes      Yes   ready         up           system   
foo3_nsd2 nsd         512       3 Yes      Yes   ready         up           system   
foo3_nsd3 nsd         512       3 Yes      Yes   ready         up           system   
foo1_nsd4 nsd         512       1 Yes      Yes   ready         up           system   
foo2_nsd4 nsd         512       2 Yes      Yes   ready         up           system   
foo3_nsd4 nsd         512       3 Yes      Yes   ready         up           system  

The foo1 filesystem now consists of 12 disks of equal size. However, 9 of the disks are at less than or equal to 20% free capacity while 3 disks are almost completely unused.


[root@foo1 ~]# mmdf foo1
disk                disk size  failure holds    holds              free KB             free KB
name                    in KB    group metadata data        in full blocks        in fragments
--------------- ------------- -------- -------- ----- -------------------- -------------------
Disks in storage pool: system (Maximum disk size allowed is 233 TB)
foo1_nsd1      31251759104        1 Yes      Yes      5145806848 ( 16%)     236481600 ( 1%)
foo1_nsd2      31251759104        1 Yes      Yes      5142163456 ( 16%)     251675200 ( 1%)
foo1_nsd3      31251759104        1 Yes      Yes      5240137728 ( 17%)     131968576 ( 0%)
foo1_nsd4      31251759104        1 Yes      Yes     31249270784 (100%)         20928 ( 0%)
foo2_nsd2      31251759104        2 Yes      Yes      5144201216 ( 16%)     176784256 ( 1%)
foo2_nsd3      31251759104        2 Yes      Yes      5979107328 ( 19%)     114685120 ( 0%)
foo2_nsd4      31251759104        2 Yes      Yes     31249854464 (100%)         17856 ( 0%)
foo2_nsd1      31251759104        2 Yes      Yes      5144209408 ( 16%)     223169664 ( 1%)
foo3_nsd3      31251759104        3 Yes      Yes      6262112256 ( 20%)     112285056 ( 0%)
foo3_nsd2      31251759104        3 Yes      Yes      5143769088 ( 16%)     143519552 ( 0%)
foo3_nsd1      31251759104        3 Yes      Yes      5142896640 ( 16%)     245539392 ( 1%)
foo3_nsd4      31251759104        3 Yes      Yes     31249786880 (100%)         21120 ( 0%)
                -------------                         -------------------- -------------------
(pool total)     375021109248                          142093316096 ( 38%)    1636168320 ( 0%)

                =============                         ==================== ===================
(total)          375021109248                          142093316096 ( 38%)    1636168320 ( 0%)

Inode Information
-----------------
Number of used inodes:        51937690
Number of free inodes:        18177638
Number of allocated inodes:   70115328
Maximum number of inodes:    146800640

Here’s the exact same output rounded to the nearest TiB for bragging rights. 🙂

[root@foo1 ~]# mmdf foo1 --block-size 1T
disk                disk size  failure holds    holds              free TB             free TB
name                    in TB    group metadata data        in full blocks        in fragments
--------------- ------------- -------- -------- ----- -------------------- -------------------
Disks in storage pool: system (Maximum disk size allowed is 233 TB)
foo1_nsd1               30        1 Yes      Yes               5 ( 16%)             1 ( 1%)
foo1_nsd2               30        1 Yes      Yes               5 ( 16%)             1 ( 1%)
foo1_nsd3               30        1 Yes      Yes               5 ( 17%)             1 ( 0%)
foo1_nsd4               30        1 Yes      Yes              30 (100%)             1 ( 0%)
foo2_nsd2               30        2 Yes      Yes               5 ( 16%)             1 ( 1%)
foo2_nsd3               30        2 Yes      Yes               6 ( 19%)             1 ( 0%)
foo2_nsd4               30        2 Yes      Yes              30 (100%)             1 ( 0%)
foo2_nsd1               30        2 Yes      Yes               5 ( 16%)             1 ( 1%)
foo3_nsd3               30        3 Yes      Yes               6 ( 20%)             1 ( 0%)
foo3_nsd2               30        3 Yes      Yes               5 ( 16%)             1 ( 0%)
foo3_nsd1               30        3 Yes      Yes               5 ( 16%)             1 ( 1%)
foo3_nsd4               30        3 Yes      Yes              30 (100%)             1 ( 0%)
                -------------                         -------------------- -------------------
(pool total)              350                                   133 ( 38%)             2 ( 0%)

                =============                         ==================== ===================
(total)                   350                                   133 ( 38%)             2 ( 0%)

Inode Information
-----------------
Number of used inodes:        51937685
Number of free inodes:        18177643
Number of allocated inodes:   70115328
Maximum number of inodes:    146800640

It’s not strictly necessary at this point but it may be desirable to restripe the filesystem to even the load across LUNs and possibly increase the performance of reading existing large files (by virtue of them becoming striped across additional LUNs). I like to limit the nodes participating in the restriping to just those acting as NSD servers. Eg.

time mmrestripefs mss1 -b -N foo1,foo2,foo3

Leave a Reply