(Or, How to migrate all GPFS filesystem metadata to SSDs)
IBM’s general advice [as of 2012] when setting up a new non-tired GPFS filesystem is to lean towards spreading data & metadata across all LUNs. See Data & Metadata in the GPFS wiki. Thus as if/when additional disks are added, the metadata performance will increase. I believe that this advice is fairly solid when using only rotational media. However, now that low cost / high write endurance MLC flashes devices are widely available (see my post on the Intel DC S3700 series), dramatic improvements in file system metadata operations are easy to achieve at low cost. I’ve now converted two GPFS clusters from using all mixed data & metadata disks to using a few SSDs as metadataOnly disks and converting the rest of the LUNs to being dataOnly. It is critical that before doing this you estimate how much metadata space your filesystem is currently using and will need in the immediate future. You will also need to monitor on going metadata usage of the filesystem.
This is the conversion procedure that I used.
First, check the configuration of the filesystem before planning any changes:
# mmdf mss1 --block-size 1T disk disk size failure holds holds free TB free TB name in TB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 233 TB) foo1_nsd3 30 1 Yes Yes 7 ( 21%) 1 ( 1%) foo1_nsd1 30 1 Yes Yes 6 ( 21%) 1 ( 1%) foo1_nsd2 30 1 Yes Yes 6 ( 20%) 1 ( 1%) foo1_nsd4 30 1 Yes Yes 7 ( 22%) 1 ( 1%) foo2_nsd3 30 2 Yes Yes 6 ( 21%) 1 ( 1%) foo2_nsd4 30 2 Yes Yes 7 ( 22%) 1 ( 1%) foo2_nsd1 30 2 Yes Yes 6 ( 20%) 1 ( 1%) foo2_nsd2 30 2 Yes Yes 6 ( 20%) 1 ( 1%) foo3_nsd4 30 3 Yes Yes 7 ( 23%) 1 ( 1%) foo3_nsd3 30 3 Yes Yes 7 ( 21%) 1 ( 1%) foo3_nsd2 30 3 Yes Yes 6 ( 20%) 1 ( 1%) foo3_nsd1 30 3 Yes Yes 6 ( 20%) 1 ( 1%) ------------- -------------------- ------------------- (pool total) 350 73 ( 21%) 4 ( 1%) ============= ==================== =================== (total) 350 73 ( 21%) 4 ( 1%) Inode Information ----------------- Total number of used inodes in all Inode spaces: 83658216 Total number of free inodes in all Inode spaces: 5114392 Total number of allocated inodes in all Inode spaces: 88772608 Total of Maximum number of inodes in all Inode spaces: 146903040 # mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- foo1_nsd1 8CFC1C05507CC5D0 /dev/sda generic pollux1.example.com server node foo1_nsd2 8CFC1C05507CC5D1 /dev/sdb generic pollux1.example.com server node foo1_nsd3 8CFC1C05507CC5D2 /dev/sdc generic pollux1.example.com server node foo1_nsd4 8CFC1C055122CB38 /dev/sdd generic pollux1.example.com server node foo2_nsd1 8CFC1C08507CC5D3 /dev/sda generic pollux2.example.com server node foo2_nsd2 8CFC1C08507CC5D6 /dev/sdb generic pollux2.example.com server node foo2_nsd3 8CFC1C08507CC5D8 /dev/sdc generic pollux2.example.com server node foo2_nsd4 8CFC1C085122CB3A /dev/sdd generic pollux2.example.com server node foo3_nsd1 8CFC1C0B507CC600 /dev/sda generic pollux3.example.com server node foo3_nsd2 8CFC1C0B507CC602 /dev/sdb generic pollux3.example.com server node foo3_nsd3 8CFC1C0B507CC605 /dev/sdc generic pollux3.example.com server node foo3_nsd4 8CFC1C0B5122CB3C /dev/sdd generic pollux3.example.com server node
We’re going to add 6 new disks (SDDs) across 3 different failure groups as we’re using metadata replication for reliability.
Note that although the LUN configuration on all 3 GPFS servers in this cluster are identical, the device name is inconsistent due to variation in boot time device enumeration that occasional be occurs. Be *very careful* that you are creating NSDs for the proper block devices.
# cat ssdstanzafile.txt %nsd: device=sdg nsd=foo1_ssd1 servers=foo1 usage=metadataOnly failureGroup=1 pool=system %nsd: device=sdh nsd=foo1_ssd2 servers=foo1 usage=metadataOnly failureGroup=1 pool=system %nsd: device=sdg nsd=foo2_ssd1 servers=foo2 usage=metadataOnly failureGroup=2 pool=system %nsd: device=sdh nsd=foo2_ssd2 servers=foo2 usage=metadataOnly failureGroup=2 pool=system %nsd: device=sde nsd=foo3_ssd1 servers=foo3 usage=metadataOnly failureGroup=3 pool=system %nsd: device=sdf nsd=foo3_ssd2 servers=foo3 usage=metadataOnly failureGroup=3 pool=system # mmcrnsd -F ssdstanzafile.txt mmcrnsd: Processing disk sdg mmcrnsd: Processing disk sdh mmcrnsd: Processing disk sdg mmcrnsd: Processing disk sdh mmcrnsd: Processing disk sde mmcrnsd: Processing disk sdf mmcrnsd: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. # mmlsnsd -X Disk name NSD volume ID Device Devtype Node name Remarks --------------------------------------------------------------------------------------------------- foo1_nsd1 8CFC1C05507CC5D0 /dev/sda generic pollux1.example.com server node foo1_nsd2 8CFC1C05507CC5D1 /dev/sdb generic pollux1.example.com server node foo1_nsd3 8CFC1C05507CC5D2 /dev/sdc generic pollux1.example.com server node foo1_nsd4 8CFC1C055122CB38 /dev/sdd generic pollux1.example.com server node foo1_ssd1 8CFC1C0551B10CF7 /dev/sdg generic pollux1.example.com server node foo1_ssd2 8CFC1C0551B10CFA /dev/sdh generic pollux1.example.com server node foo2_nsd1 8CFC1C08507CC5D3 /dev/sda generic pollux2.example.com server node foo2_nsd2 8CFC1C08507CC5D6 /dev/sdb generic pollux2.example.com server node foo2_nsd3 8CFC1C08507CC5D8 /dev/sdc generic pollux2.example.com server node foo2_nsd4 8CFC1C085122CB3A /dev/sdd generic pollux2.example.com server node foo2_ssd1 8CFC1C0851B10CFC /dev/sdg generic pollux2.example.com server node foo2_ssd2 8CFC1C0851B10CFF /dev/sdh generic pollux2.example.com server node foo3_nsd1 8CFC1C0B507CC600 /dev/sda generic pollux3.example.com server node foo3_nsd2 8CFC1C0B507CC602 /dev/sdb generic pollux3.example.com server node foo3_nsd3 8CFC1C0B507CC605 /dev/sdc generic pollux3.example.com server node foo3_nsd4 8CFC1C0B5122CB3C /dev/sdd generic pollux3.example.com server node foo3_ssd1 8CFC1C0B51B10D00 /dev/sde generic pollux3.example.com server node foo3_ssd2 8CFC1C0B51B10D01 /dev/sdf generic pollux3.example.com server node
We’re now reading to add the new disks/NSDs to the existing filesystem. Note that the moment we add these devices to the filesystem metadata may immediately start being written to them.
# mmlsdisk mss1 disk driver sector failure holds holds storage name type size group metadata data status availability pool ------------ -------- ------ ------- -------- ----- ------------- ------------ ------------ foo1_nsd1 nsd 512 1 Yes Yes ready up system foo1_nsd2 nsd 512 1 Yes Yes ready up system foo1_nsd3 nsd 512 1 Yes Yes ready up system foo2_nsd1 nsd 512 2 Yes Yes ready up system foo2_nsd2 nsd 512 2 Yes Yes ready up system foo2_nsd3 nsd 512 2 Yes Yes ready up system foo3_nsd1 nsd 512 3 Yes Yes ready up system foo3_nsd2 nsd 512 3 Yes Yes ready up system foo3_nsd3 nsd 512 3 Yes Yes ready up system foo1_nsd4 nsd 512 1 Yes Yes ready up system foo2_nsd4 nsd 512 2 Yes Yes ready up system foo3_nsd4 nsd 512 3 Yes Yes ready up system # mmadddisk mss1 -F ssdstanzafile.txt The following disks of mss1 will be formatted on node foo1.example.com: foo1_ssd1: size 194805760 KB foo1_ssd2: size 194805760 KB foo2_ssd1: size 194805760 KB foo2_ssd2: size 194805760 KB foo3_ssd1: size 194805760 KB foo3_ssd2: size 194805760 KB Extending Allocation Map Checking Allocation Map for storage pool system Completed adding disks to file system mss1. mmadddisk: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. # mmlsdisk mss1 disk driver sector failure holds holds storage name type size group metadata data status availability pool ------------ -------- ------ ------- -------- ----- ------------- ------------ ------------ foo1_nsd1 nsd 512 1 Yes Yes ready up system foo1_nsd2 nsd 512 1 Yes Yes ready up system foo1_nsd3 nsd 512 1 Yes Yes ready up system foo2_nsd1 nsd 512 2 Yes Yes ready up system foo2_nsd2 nsd 512 2 Yes Yes ready up system foo2_nsd3 nsd 512 2 Yes Yes ready up system foo3_nsd1 nsd 512 3 Yes Yes ready up system foo3_nsd2 nsd 512 3 Yes Yes ready up system foo3_nsd3 nsd 512 3 Yes Yes ready up system foo1_nsd4 nsd 512 1 Yes Yes ready up system foo2_nsd4 nsd 512 2 Yes Yes ready up system foo3_nsd4 nsd 512 3 Yes Yes ready up system foo1_ssd1 nsd 512 1 Yes No ready up system foo1_ssd2 nsd 512 1 Yes No ready up system foo2_ssd1 nsd 512 2 Yes No ready up system foo2_ssd2 nsd 512 2 Yes No ready up system foo3_ssd1 nsd 512 3 Yes No ready up system foo3_ssd2 nsd 512 3 Yes No ready up system
The new disks/SSDs are now properly part of the existing filesystem. However, not only is the filesystem metadata still spread across the previous rotational LUNs, new/changed metadata will continue to spread across those LUNs. We need to prevent any new metadata from being written to those disks. We do that by changing each disk from allowing dataAndMetadata
to being dataOnly
.
# cat stanzafile.txt %nsd: nsd=foo1_nsd1 servers=foo1 usage=dataOnly failureGroup=1 pool=system %nsd: nsd=foo1_nsd2 servers=foo1 usage=dataOnly failureGroup=1 pool=system %nsd: nsd=foo1_nsd3 servers=foo1 usage=dataOnly failureGroup=1 pool=system %nsd: nsd=foo1_nsd4 servers=foo1 usage=dataOnly failureGroup=1 pool=system %nsd: nsd=foo2_nsd1 servers=foo2 usage=dataOnly failureGroup=2 pool=system %nsd: nsd=foo2_nsd2 servers=foo2 usage=dataOnly failureGroup=2 pool=system %nsd: nsd=foo2_nsd3 servers=foo2 usage=dataOnly failureGroup=2 pool=system %nsd: nsd=foo2_nsd4 servers=foo2 usage=dataOnly failureGroup=2 pool=system %nsd: nsd=foo3_nsd1 servers=foo3 usage=dataOnly failureGroup=3 pool=system %nsd: nsd=foo3_nsd2 servers=foo3 usage=dataOnly failureGroup=3 pool=system %nsd: nsd=foo3_nsd3 servers=foo3 usage=dataOnly failureGroup=3 pool=system %nsd: nsd=foo3_nsd4 servers=foo3 usage=dataOnly failureGroup=3 pool=system # mmchdisk mss1 change -F stanzafile.txt Verifying file system configuration information ... mmchdisk: Propagating the cluster configuration data to all affected nodes. This is an asynchronous process. # mmlsdisk mss1 disk driver sector failure holds holds storage name type size group metadata data status availability pool ------------ -------- ------ ------- -------- ----- ------------- ------------ ------------ foo1_nsd1 nsd 512 1 No Yes ready up system foo1_nsd2 nsd 512 1 No Yes ready up system foo1_nsd3 nsd 512 1 No Yes ready up system foo2_nsd1 nsd 512 2 No Yes ready up system foo2_nsd2 nsd 512 2 No Yes ready up system foo2_nsd3 nsd 512 2 No Yes ready up system foo3_nsd1 nsd 512 3 No Yes ready up system foo3_nsd2 nsd 512 3 No Yes ready up system foo3_nsd3 nsd 512 3 No Yes ready up system foo1_nsd4 nsd 512 1 No Yes ready up system foo2_nsd4 nsd 512 2 No Yes ready up system foo3_nsd4 nsd 512 3 No Yes ready up system foo1_ssd1 nsd 512 1 Yes No ready up system foo1_ssd2 nsd 512 1 Yes No ready up system foo2_ssd1 nsd 512 2 Yes No ready up system foo2_ssd2 nsd 512 2 Yes No ready up system foo3_ssd1 nsd 512 3 Yes No ready up system foo3_ssd2 nsd 512 3 Yes No ready up system Attention: Due to an earlier configuration change the file system is no longer properly replicated. # mmdf mss1 --block-size 1T disk disk size failure holds holds free TB free TB name in TB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 233 TB) foo1_nsd1 30 1 No Yes 6 ( 21%) 1 ( 1%) foo1_nsd2 30 1 No Yes 6 ( 20%) 1 ( 1%) foo1_nsd3 30 1 No Yes 7 ( 21%) 1 ( 1%) foo1_ssd2 1 1 Yes No 1 (100%) 1 ( 0%) foo1_ssd1 1 1 Yes No 1 (100%) 1 ( 0%) foo1_nsd4 30 1 No Yes 7 ( 22%) 1 ( 1%) foo2_ssd2 1 2 Yes No 1 (100%) 1 ( 0%) foo2_ssd1 1 2 Yes No 1 (100%) 1 ( 0%) foo2_nsd1 30 2 No Yes 6 ( 20%) 1 ( 1%) foo2_nsd3 30 2 No Yes 6 ( 21%) 1 ( 1%) foo2_nsd4 30 2 No Yes 7 ( 22%) 1 ( 1%) foo2_nsd2 30 2 No Yes 6 ( 20%) 1 ( 1%) foo3_nsd4 30 3 No Yes 7 ( 23%) 1 ( 1%) foo3_nsd3 30 3 No Yes 7 ( 21%) 1 ( 1%) foo3_nsd2 30 3 No Yes 6 ( 20%) 1 ( 1%) foo3_nsd1 30 3 No Yes 6 ( 20%) 1 ( 1%) foo3_ssd1 1 3 Yes No 1 (100%) 1 ( 0%) foo3_ssd2 1 3 Yes No 1 (100%) 1 ( 0%) ------------- -------------------- ------------------- (pool total) 351 74 ( 21%) 4 ( 1%) ============= ==================== =================== (data) 350 73 ( 21%) 4 ( 1%) (metadata) 2 2 (100%) 1 ( 0%) ============= ==================== =================== (total) 351 74 ( 21%) 4 ( 1%) Inode Information ----------------- Total number of used inodes in all Inode spaces: 83693679 Total number of free inodes in all Inode spaces: 5078929 Total number of allocated inodes in all Inode spaces: 88772608 Total of Maximum number of inodes in all Inode spaces: 146903040 # mmdf mss1 -m disk disk size failure holds holds free KB free KB name in KB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 233 TB) foo1_ssd2 194805760 1 Yes No 190849024 ( 98%) 5440 ( 0%) foo1_ssd1 194805760 1 Yes No 190865408 ( 98%) 5568 ( 0%) foo2_ssd2 194805760 2 Yes No 190914560 ( 98%) 5504 ( 0%) foo2_ssd1 194805760 2 Yes No 190877696 ( 98%) 5504 ( 0%) foo3_ssd1 194805760 3 Yes No 190867456 ( 98%) 5504 ( 0%) foo3_ssd2 194805760 3 Yes No 190879744 ( 98%) 5504 ( 0%) ------------- -------------------- ------------------- (pool total) 1168834560 1145253888 ( 98%) 33024 ( 0%) ============= ==================== =================== (data) 0 0 ( 0%) 0 ( 0%) (metadata) 1168834560 1145253888 ( 98%) 33024 ( 0%) ============= ==================== =================== (total) 1168834560 1145253888 ( 98%) 33024 ( 0%)
We’ve now prevented any new metadata from going to the rotational LUNs and ensured that all new/updated metadata will be written to the SSDs we just added to the filesystem. All of the metadata that’s on the rotational LUNs is still accessible and will be migrated over to the metadataOnly disks upon update. However, this may never happen for static unchanging files. We can force all metadata to be migrated from disks that don’t allow metadata to disks that due with a restriping operation. This may be done with the filesystem online.
# screen -S mmrestripefs # mmrestripefs mss1 -r -N foo1,pollux2,pollux3 Scanning file system metadata, phase 1 ... 1 % complete on Thu Jun 6 15:46:19 2013 2 % complete on Thu Jun 6 15:46:23 2013 3 % complete on Thu Jun 6 15:46:28 2013 4 % complete on Thu Jun 6 15:46:33 2013 5 % complete on Thu Jun 6 15:46:38 2013 ... 98.22 % complete on Thu Jun 6 19:30:54 2013 ( 84425985 inodes 284678907 MB) 98.24 % complete on Thu Jun 6 19:33:14 2013 ( 85424754 inodes 285549722 MB) 98.26 % complete on Thu Jun 6 19:35:45 2013 ( 86436023 inodes 286198122 MB) 98.28 % complete on Thu Jun 6 19:38:25 2013 ( 87601428 inodes 286622429 MB) 100.00 % complete on Thu Jun 6 19:41:03 2013 Scan completed successfully. # mmdf mss1 disk disk size failure holds holds free KB free KB name in KB group metadata data in full blocks in fragments --------------- ------------- -------- -------- ----- -------------------- ------------------- Disks in storage pool: system (Maximum disk size allowed is 233 TB) foo1_nsd1 31251759104 1 No Yes 5806698496 ( 19%) 421305664 ( 1%) foo1_nsd2 31251759104 1 No Yes 5621813248 ( 18%) 402546240 ( 1%) foo1_nsd3 31251759104 1 No Yes 5845639168 ( 19%) 271411776 ( 1%) foo1_ssd2 194805760 1 Yes No 126115840 ( 65%) 709440 ( 0%) foo1_ssd1 194805760 1 Yes No 126117888 ( 65%) 693888 ( 0%) foo1_nsd4 31251759104 1 No Yes 6210627584 ( 20%) 189672192 ( 1%) foo2_ssd2 194805760 2 Yes No 126160896 ( 65%) 705344 ( 0%) foo2_ssd1 194805760 2 Yes No 126113792 ( 65%) 703360 ( 0%) foo2_nsd1 31251759104 2 No Yes 5767004160 ( 18%) 408482432 ( 1%) foo2_nsd3 31251759104 2 No Yes 5811671040 ( 19%) 281079488 ( 1%) foo2_nsd4 31251759104 2 No Yes 6184667136 ( 20%) 188337216 ( 1%) foo2_nsd2 31251759104 2 No Yes 5609695232 ( 18%) 410381504 ( 1%) foo3_nsd4 31251759104 3 No Yes 6619357184 ( 21%) 185679168 ( 1%) foo3_nsd3 31251759104 3 No Yes 5997352960 ( 19%) 270953280 ( 1%) foo3_nsd2 31251759104 3 No Yes 5607870464 ( 18%) 409104384 ( 1%) foo3_nsd1 31251759104 3 No Yes 5713088512 ( 18%) 413708416 ( 1%) foo3_ssd1 194805760 3 Yes No 126144512 ( 65%) 711552 ( 0%) foo3_ssd2 194805760 3 Yes No 126164992 ( 65%) 713088 ( 0%) ------------- -------------------- ------------------- (pool total) 376189943808 71552303104 ( 19%) 3856898432 ( 1%) ============= ==================== =================== (data) 375021109248 70795485184 ( 19%) 3852661760 ( 1%) (metadata) 1168834560 756817920 ( 65%) 4236672 ( 0%) ============= ==================== =================== (total) 376189943808 71552303104 ( 19%) 3856898432 ( 1%) Inode Information ----------------- Total number of used inodes in all Inode spaces: 83929469 Total number of free inodes in all Inode spaces: 4843139 Total number of allocated inodes in all Inode spaces: 88772608 Total of Maximum number of inodes in all Inode spaces: 146903040
The metadata migration is complete at this point.
Sadly, I don’t have good base line bench mark numbers for the filesystem I used in these notes as there are present no 10Gbit/s connected native GPFS clients attached to it. I do have some simple before/after dbench
results for a different GPFS cluster that has 4 SSDs in it in two different failure groups (with metadata replication set to 2) with 10Gbit/s clients. Tests we’re run with dbench --directory=/test/path -t 60 24
.
Before SSDs:
Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 485447 0.170 680.002 Close 356281 0.046 437.856 Rename 20590 0.502 513.069 Unlink 98341 0.707 529.437 Qpathinfo 440253 0.014 447.824 Qfileinfo 76704 0.001 3.801 Qfsinfo 80869 0.006 3.806 Sfileinfo 39540 0.050 41.414 Find 170246 0.046 36.392 WriteX 239892 0.552 475.391 ReadX 762393 0.043 554.974 LockX 1588 0.010 0.085 UnlockX 1588 0.004 0.037 Flush 34067 31.650 1069.617 Throughput 254.073 MB/sec 24 clients 24 procs max_latency=1069.620 ms
After SSDs:
Operation Count AvgLat MaxLat ---------------------------------------- NTCreateX 828324 0.121 46.227 Close 607835 0.046 40.431 Rename 35074 0.310 38.669 Unlink 167756 0.309 50.825 Qpathinfo 750865 0.024 5.293 Qfileinfo 130729 0.002 1.581 Qfsinfo 137912 0.011 2.209 Sfileinfo 67341 0.082 222.078 Find 290480 0.090 5.365 WriteX 408893 0.310 53.797 ReadX 1299853 0.010 2.386 LockX 2690 0.018 1.357 UnlockX 2690 0.008 0.138 Flush 58018 17.885 263.361 Throughput 430.239 MB/sec 24 clients 24 procs max_latency=263.368 ms