Originally published 2020
Recently my team needed to rebuild a number of servers to increase capacity. The original server cluster was physically at capacity (drive space), we were re-using x86_64 servers until the new proper installation with the upgraded software was available. And we couldn’t wait for the new installation as it was months out but the current server farm was going to exceed capacity in a few weeks.
Since the new servers had to match the old systems, we were limited to installing RedHat Enterprise Linux 6 (RHEL-6) instead of moving to the more modern RHEL-7. And to further complicate things, not all of these upgrade systems had hard drives of sufficient capacity. BUT, we were also given an older SAN that had plenty of capacity and we had compatible cards to connect the servers to it.
With that mixture of equipment we set out to install.
Complications
Thankfully the systems did have a small (300GB) drive we could use to install the operating system to. Our base RHEL-6 image installed easily, and we got the SAN connections configured and things were looking good. Or so we thought.
The SAN team found that they were unable to present anything but a single 30 TiB (terabyte) drive to each system. That met the storage requirements for raw capacity, but presenting it as other formats was eluding us for some reason.
The next issue we ran into is that the software required (or at least the software vendor only approved) the use of the “ext4” file system, not the newer “xfs” filesystem that Linux and RHEL-6 provide.
Not an issue… Until you realize that ext4 on RHEL-6 on x86_64 is limited to 16TiB per partition. Additionally, RHEL 6 only supports these large drives in GPT disk format mode (not MBR which has a 2TB disk limit).
Knowing what to do.
The first step in automating a process is ensuring you understand what specifically is needed – after-all, you can’t automate what you don’t understand. Even though the steps were extremely simple, there was still room for human error, but an even bigger worry was the chance that over time people executing these manual tasks would start to accidentally skip steps or take shortcuts or other time saving measures that started to make the newest systems look different from the original ones.
But more on that, first lets look at what we knew we had to do on each system.
Step 1 – Clean up old partitions and create two new that are <= 16TiB.
sgdisk -g --clear /dev/sdXX
sgdisk --new 1:2048:34359738368 --new 2:34359740416:64403034568 /dev/sXX
This will clean up the old mount points and create two partitions, one 16TB, the other 14TB. We could have done two 15TiB partitions too. Or three 10TiB. You get the picture.
Step 2 – Format the filesystems with the necessary filesystem options.
mkfs.ext4 /dev/sdXX -E lazy_itable_init
tune2fs -m0 /dev/sdXXX
Formatting a large partition can take quite a while so we added the “-E lazy_itable_init” flag. This performs the basic formatting needed to make the partition available, but it defers the actual formatting of all the sectors until they are first used. This flag speeds up the initial setup of the servers with only a minor delay upon the first write is spread out over time and is only an impact during the first write to that sector.
Step 3 – Mount them at the proper location
There were not a lot of special mount point requirements so to keep their usage easier by the developers, they mirrored the initial systems using the “/mounts/” root mount point.
mkdir -p /mounts/data##
mount -t ext4 /dev/sdXX /mounts/data##
Step 4 – Setup the persistent mount in /etc/fstab
Finally to make sure these partitions return after reboot, we added lines to the /etc/fstab for each partition.
/dev/sdXX /mounts/data## ext4 noatime,nodiratime,nobarrier,nofail,rw 0 2
The specific parameters such as “noatime” and “nodiratime” were specifically requested by the vendor to improve system performance by reducing filesystem updates to the metadata about the files which didn’t provide much value. And since the data within these partitions is replicated across multiple machines in the cluster, the “filesystem dump” flag (the “0”) was set so the backup system knew to ignore this filesystem. (We actually use a much more modern tool to backup systems instead of “dump” but this was added for consistency.)
Rinse and repeat…
Now that we knew what to do, it was easy to see that there were many possible points of failure if we had humans doing these across dozens of servers, especially when we were expecting this cluster to grow again if the new cluster deployment was further delayed.
To address all this, we build an Ansible playbook to automate all these steps. We designed it from the start to be flexible so if the process proved useful in the new server deployments it would be minimally challenging to re-use the code.
We started by setting up an inventory file and defining the drive letters each system presented.
[data_nodes] srv01.company.com srv02.company.com...srv25.company.com
[data_nodes:vars] data_drives_01={ "drive": "/dev/sdb", "part": "/dev/sdb1", "mount": "/mounts/data01", "fstype": "ext4"}data_drives_02={ "drive": "/dev/sdb", "part": "/dev/sdb2", "mount": "/mounts/data02", "fstype": "ext4"}
There are better ways to define this inventor`y that would have been more flexible over time, but this is what we used and the need to rewrite the inventory file and associated pieces of the Ansible playbook weren’t judged as necessary at this time.
That inventory file defines the new data nodes, “srv01” through “srv25”, and for each of them they define the drives, the partitions, the filesystem format, and the mount points.
From that inventory file we setup this playbook.
- name: Create the data partitions
parted:
label: gpt
device: "{{ lookup('vars', item).drive }}"
name: "{{ lookup('vars', item).mount }}"
number: 1 # Probably need to make this dynamic later.
state: present
unit: GiB
loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"
- name: Check for partition
stat:
path: "{{ lookup('vars', item).part }}"
loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"
- name: Setup Hadoop data filesystems if necessary
filesystem:
fstype: "{{ lookup('vars', item).fstype }}"
dev: "{{ lookup('vars', item).part }}"
opts: "{{ sys_data_drive_opts }}"
loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"
- name: Tune Hadoop data filesystem
command: tune2fs -m0 {{ lookup('vars', item).part }}
loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"
changed_when: false
- name: Setup mount for data filesystems
mount:
name: "{{ lookup('vars', item).mount }}"
src: "{{ lookup('vars', item).part }}"
fstype: "{{ lookup('vars', item).fstype }}"
opts: "noatime,nodiratime,nobarrier,nofail,rw"
state: mounted
boot: yes
loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"
- name: "Build mount point with permissions"
file:
path: "{{ lookup('vars', item).mount }}"
owner: root
group: root
state: directory
mode: "{{ lookup('vars', item).perms | default() }}"
loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"
2 replies on “RHEL 6 and building large (>2TB) drives/partitions”
Document this next:
https://feryn.eu/blog/json-pretty-printing-single-python-command/
This would also be helpful: https://stackoverflow.com/a/43627697/187426