Categories
Uncategorized

Building a custom RHEL-8 ISO

For a consistent experience for my end-users as well as the operations team that supports their servers, we build our systems from a custom ISO that uses a company Kickstart file. This Kickstart file defines the initial disk layout, the use of LVM to biuld partitions, which base packages to install, and the initial root user account and credentials. The Kickstart also injects a script that is run the first time the root users logs in locally; it prompts the user for the hostname and domain, DNS servers, the IP address information, sets up the NIC bonding, and verifies the disk(s) to use for the OS installation. All of this helps ensure our builds are consistent and that our automation tool (an Ansible playbook) can connect and finish the initial configuration.

Each release of a Red Hat OS update has minor changes necessary to the Kickstart file, and the release of Red Hat Enterprise Linux 8 is no exception.

Sources:

https://access.redhat.com/solutions/60959

Categories
Uncategorized

View all Ansible Variables

Use this command:

$ ansible-inventory -I /path/to/inventory --yaml --graph --vars

To get the variables dumped out like this:

@all
  |--@group_1:
  |  |--host_a
  |  |  |--{variable_a = Value}
  |  |  |--{list_a = {u'alpha': 'alpha value', u'beta': 'beta value'}}
  |  |  |--{host_a_variable = Special for host_a}
  |  |--host_b
  |  |  |--{variable_a = Value}
  |  |  |--{host_b_variable = Special for host_b}
Categories
Uncategorized

RHEL 6 and building large (>2TB) drives/partitions

Originally published 2020

Recently my team needed to rebuild a number of servers to increase capacity.  The original server cluster was physically at capacity (drive space), we were re-using x86_64 servers until the new proper installation with the upgraded software was available.  And we couldn’t wait for the new installation as it was months out but the current server farm was going to exceed capacity in a few weeks.

Since the new servers had to match the old systems, we were limited to installing RedHat Enterprise Linux 6 (RHEL-6) instead of moving to the more modern RHEL-7.  And to further complicate things, not all of these upgrade systems had hard drives of sufficient capacity.  BUT, we were also given an older SAN that had plenty of capacity and we had compatible cards to connect the servers to it.

With that mixture of equipment we set out to install.

Complications

Thankfully the systems did have a small (300GB) drive we could use to install the operating system to.  Our base RHEL-6 image installed easily, and we got the SAN connections configured and things were looking good.  Or so we thought.

The SAN team found that they were unable to present anything but a single 30 TiB (terabyte) drive to each system.  That met the storage requirements for raw capacity, but presenting it as other formats was eluding us for some reason.

The next issue we ran into is that the software required (or at least the software vendor only approved) the use of the “ext4” file system, not the newer “xfs” filesystem that Linux and RHEL-6 provide.

Not an issue…   Until you realize that ext4 on RHEL-6 on x86_64 is limited to 16TiB per partition. Additionally, RHEL 6 only supports these large drives in GPT disk format mode (not MBR which has a 2TB disk limit).

Knowing what to do.

The first step in automating a process is ensuring you understand what specifically is needed – after-all, you can’t automate what you don’t understand.  Even though the steps were extremely simple, there was still room for human error, but an even bigger worry was the chance that over time people executing these manual tasks would start to accidentally skip steps or take shortcuts or other time saving measures that started to make the newest systems look different from the original ones.

But more on that, first lets look at what we knew we had to do on each system.

Step 1 – Clean up old partitions and create two new that are <= 16TiB.

sgdisk -g --clear /dev/sdXX
sgdisk --new 1:2048:34359738368 --new 2:34359740416:64403034568 /dev/sXX

This will clean up the old mount points and create two partitions, one 16TB, the other 14TB.  We could have done two 15TiB partitions too. Or three 10TiB.  You get the picture.

Step 2 – Format the filesystems with the necessary filesystem options.

mkfs.ext4 /dev/sdXX -E lazy_itable_init
tune2fs -m0 /dev/sdXXX

Formatting a large partition can take quite a while so we added the “-E lazy_itable_init” flag.  This performs the basic formatting needed to make the partition available, but it defers the actual formatting of all the sectors until they are first used.  This flag speeds up the initial setup of the servers with only a minor delay upon the first write is spread out over time and is only an impact during the first write to that sector.

Step 3 – Mount them at the proper location

There were not a lot of special mount point requirements so to keep their usage easier by the developers, they mirrored the initial systems using the “/mounts/” root mount point.

mkdir -p /mounts/data##
mount -t ext4 /dev/sdXX /mounts/data##

Step 4 – Setup the persistent mount in /etc/fstab

Finally to make sure these partitions return after reboot, we added lines to the /etc/fstab for each partition.

/dev/sdXX  /mounts/data##  ext4  noatime,nodiratime,nobarrier,nofail,rw  0 2

The specific parameters such as “noatime” and “nodiratime” were specifically requested by the vendor to improve system performance by reducing filesystem updates to the metadata about the files which didn’t provide much value.  And since the data within these partitions is replicated across multiple machines in the cluster, the “filesystem dump” flag (the “0”) was set so the backup system knew to ignore this filesystem.  (We actually use a much more modern tool to backup systems instead of “dump” but this was added for consistency.)

Rinse and repeat…

Now that we knew what to do, it was easy to see that there were many possible points of failure if we had humans doing these across dozens of servers, especially when we were expecting this cluster to grow again if the new cluster deployment was further delayed.

To address all this, we build an Ansible playbook to automate all these steps.  We designed it from the start to be flexible so if the process proved useful in the new server deployments it would be minimally challenging to re-use the code.

We started by setting up an inventory file and defining the drive letters each system presented.

[data_nodes]
srv01.company.com
srv02.company.com...srv25.company.com

[data_nodes:vars]
data_drives_01={ "drive": "/dev/sdb", "part": "/dev/sdb1", "mount": "/mounts/data01", "fstype": "ext4"}data_drives_02={ "drive": "/dev/sdb", "part": "/dev/sdb2", "mount": "/mounts/data02", "fstype": "ext4"}

There are better ways to define this inventor`y that would have been more flexible over time, but this is what we used and the need to rewrite the inventory file and associated pieces of the Ansible playbook weren’t judged as necessary at this time.

That inventory file defines the new data nodes, “srv01” through “srv25”, and for each of them they define the drives, the partitions, the filesystem format, and the mount points.

From that inventory file we setup this playbook.

- name: Create the data partitions
  parted:
    label: gpt
    device: "{{ lookup('vars', item).drive }}"
    name: "{{ lookup('vars', item).mount }}"
    number: 1 # Probably need to make this dynamic later.
    state: present
    unit: GiB
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

- name: Check for partition
  stat:
    path: "{{ lookup('vars', item).part }}"
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

- name: Setup Hadoop data filesystems if necessary
  filesystem:
    fstype: "{{ lookup('vars', item).fstype }}"
    dev: "{{ lookup('vars', item).part }}"
    opts: "{{ sys_data_drive_opts }}"
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

- name: Tune Hadoop data filesystem
  command: tune2fs -m0 {{ lookup('vars', item).part }}
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"
  changed_when: false

- name: Setup mount for data filesystems
  mount:
    name: "{{ lookup('vars', item).mount }}"
    src: "{{ lookup('vars', item).part }}"
    fstype: "{{ lookup('vars', item).fstype }}"
    opts: "noatime,nodiratime,nobarrier,nofail,rw"
    state: mounted
    boot: yes
loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

- name: "Build mount point with permissions"
  file:
    path: "{{ lookup('vars', item).mount }}"
    owner: root
    group: root
    state: directory
    mode: "{{ lookup('vars', item).perms | default() }}"
  loop: "{{ vars.keys() | list | select('match', '^.*data_drives_.*
) | list | sort }}"

Categories
Uncategorized

Ansible, Check_mode, and Async plays.

Have a task in your Ansible playbook that takes a long time to run, say a very large package installation or download across a slow network link? Depending on how long it takes, Ansible may think the command has failed and fail at that point in the playbook.

Async and Poll in a nutshell

The standard way to do this is to use the Ansible async: and poll: flags. The documentation isn’t really clear on this, so here’s how I think of their actions:

  • The async: B flag says “Run this command in the background for B seconds….”
  • The poll: P flag means “…and check the status every P seconds.”

Thus, a command like this:

- name: Download a big file
  shell:
        "wget -O /tmp/my_big_file.iso https://example.com/downloads/a_really_big_file.iso" 
      async: 120
      poll: 5

(Yes, I know there are more Ansible-friendly ways to download a file from a remote URL, but go along with the example…)

So on a good day when the download speeds are high, it might download and Ansible will continue on. On days when the Internet connection is slow, this tool will kick off the wget command, and every 5 seconds it will check if the command is done. When it completes, the playbook goes on. If the wget fails (network error, disk write, etc), or the command takes longer than 120 seconds, Ansible will fail this step as expected.

That’s all well and good. What’s the catch?

Check mode

One feature I love about Ansible is the --check mode option. A well written Ansible module will run in --check mode and do everything it can to validate that it will execute on the managed systems without making any changes to the remote system. This is key when you’re working on a playbook to maintain production systems.

Say you know that a configuration file needs a correction applied. You take the playbook you used to build the system originally, check it out of your source control to a new branch and modify the playbook.

But a cautious developer will check that the playbook runs as expected and doesn’t do anything else unexpected (reboot the server, stop services, fail mid-way through, etc.). To do this, run your playbook with the --check flag. The output looks identical to when it is run normally, but this time the lines that are changed: are actually not changes, rather telling you that this play would make a change.

Some commands are inherently un-safe for Ansible to generically run them, tasks such as shell:, command:, and others more “raw” command have this limitation. Ansible tries to make sure that a command run in check mode will make no changes whatsoever.

The check mode execution is handy when combined with the --diff command line flag, but that’s a story for another day.

Async and Check mode

So, using these together makes sense. I want to download a large file over an occasionally slow link but I do not want the download to run when I’m in check mode. You’d think something like the example code from above would be the correct combination:

- name: Download a big file
  shell:
        "wget -O /tmp/my_big_file.iso https://example.com/downloads/a_really_big_file.iso" 
      async: 120
      poll: 5

But when you run it with the --check flag, you get this error:

TASK [Download a big file] ***************************
task path: ./playbook.yml:71
fatal: [localhost]: FAILED! => {
    "changed": false,
    "msg": "check mode and async cannot be used on same task."
}

What to do?

I have to admit, I didn’t think up this workaround – a Mr. Alex Dzyoba documented it on his blog and I came across it here:

https://alex.dzyoba.com/blog/ansible-check-async/

What he documents is using the ansible_check_mode variable, then set the async: value to 0 if we’re in check mode, or 120 if we are not. Using our play above we would do this:

- name: Download a big file
  shell:
        "wget -O /tmp/my_big_file.iso https://example.com/downloads/a_really_big_file.iso" 
      async: "{{ ansible_check_mode | ternary(0,120) }}"
      poll: 5

What ends up happening is based on the ansible_check_mode variable:

  • If we are running in check mode (i.e. ansible_check_mode is true), then the value passed to async: is zero (the first value in the ternary() call, and Ansible doesn’t complain about the conflict.
  • When we are running in normal mode (i.e. ansible_check_mode is false), then the value passed to async: is the second value in the ternary() call, and the play will run for 120 seconds.

Why Ansible doesn’t automatically handle this is beyond me, but I’m glad to have come across Mr. Alex Dzyoba website and this method.

Categories
Uncategorized

File date details

Recently I rebuilt my primary workstation and was restoring files from backup. I have a few copies built up over time and was trying to determine which specific files I wanted to keep. The output of ls was cumbersome, sometimes listing the year and other times listing just the month and date.

The convention of the ls command is to only show the date when the timestamp of the file is more than six months away from the current date, but scanning this list was annoying as it swapped from an older file (Sep 29 2017) to newer files (Nov 23 20:53).

Of course there was a flag for ls to handle this…

This answer on https://unix.stackexchange.com was spot-on for what I was looking for:

In short the ls -l --time-style=long-iso flags keep the format consistent: 2021-08-22 12:00, 2020-12-16 05:04, or 2022-10-21 04:12

Viewing my backup archive directory now shows things consistently:

$ ls -al --time-style=long-iso
total 24
drwxr-xr-x 2 999 999 0 2022-11-13 14:30 .
drwxr-xr-x 2 999 999 0 2022-11-16 17:21 ..
-rwxr-xr-x 1 999 999 3478 2021-10-16 01:30 BackupToNAS.2021-10-16_0130.log
-rwxr-xr-x 1 999 999 4388 2021-10-16 01:30 BackupToNAS.2021-10-16_0130.log-scriptlog
-rwxr-xr-x 1 999 999 2986 2021-11-13 10:17 BackupToNAS.2021-11-13_1010.log
-rwxr-xr-x 1 999 999 4753 2021-11-13 10:17 BackupToNAS.2021-11-13_1010.log-scriptlog
drwxr-xr-x 2 999 999 0 2022-11-14 06:04 dan
drwxr-xr-x 2 999 999 0 2022-11-14 06:03 dan.old

Categories
Uncategorized

My case for impeachment

Please vote to find Donald Trump guilty in his impeachment trial.

Through his inflammatory rhetoric and deeply misogynistic words and actions, while President over the four years he was in office, and to end the presidency condoning the attack on the capital through his INACTION, he was by definition NOT to adhere to his oath to “preserve, protect and defend the Constitution of the United States.”

And if the defense is that “he tried” with calming words, then that was “to the best” of his ability and is also derelict of his duties as President and must be found guilty.

As a man who brought two girls into this world, a loving husband, and brother of two strong and amazing women, I am surprised that I have lived as an American for 50 years and thought we were better than this. His public comments about women have left me shaking my head as others have listened on and ignored them for the direction he has been taking our country.

Failing to impeach a person like Donald Trump would set the precedent that they can neglect their oath of office and fail to protect ALL AMERICANS regardless of political leaning, social standing, ethnic background, gender, or the color of their skin. During his time in office, he is on public record in many mediums (Twitter, MAGA rallies, TV interviews, etc.) blatantly dividing our country and failing to live up to the high moral ethics we hold for our President.

Please find Mr. Trump guilty in the upcoming impeachment trial. If you do not, please explain yourself by explaining to me and other voters why you chose to ignore all these public findings against him.

Categories
Uncategorized

Doctors are human and make mistakes

This article on WOWT Channel 6 about a letter sent by a doctor’s office caught my attention, but not in a good way.

https://www.wowt.com/2020/07/25/omaha-doctors-office-issues-controversial-letter-about-children-and-covid-19/

After reading this letter, I would probably look to another family medicine practice to work with.

I have written a lot of documents for work and personal, and persuasive documents are some of the most critical if you want to get your message across, convince people to take up your cause, or even simply help them understand your point of view. In my opinion, this letter does none of that and is probably going to cause problems for them in the long run. (And in case it’s not obvious, I’m not a medical practitioner so please talk with a medical professional you trust if you have questions.)

Right off the bat, they mentioned “SARS-CaronaVirus-2” and “COVID-19” – both are referring to the same viral infection, the “SARS” name is the formal name but they use both within the letter. They don’t mention this (the letter is aimed at a non-medical audience), and it’s not evident why they felt the need to alternate the names. In this document, I’ll stick with the common COVID-19 name.

In the first section, they discuss treating patients with Hydroxychloroquine, ZPak, and other medications. The Hydroxychloroquine treatment made the news earlier this summer as the “super cure” by some people. There were reports of its effectiveness in some trials, but none of those trials could be reproduced and many more trials showed no significant benefit to treat COVID-19, and its known side effects are bad enough to make taking it risky when it is used properly (https://www.webmd.com/lung/news/20200407/side-effects-halt-use-of-chloroquine-vs-covid-19). And their use of ZPak is also concerning – ZPak is commonly used to treat bacterial infections, not viral infections such as COVID-19. Again the side effects of using ZPak in this manner are concerning because their over (mis-)use will ultimately breed antibiotic-resistant bacteria. Their off-label use of ZPak and Hydroxychloroquine seems to be pandering more to the “Karens” of the world instead of relying on sound medical practice.

The next section down-plays the role of masks in reducing the spread of the disease. A quick search of the Internet using your preferred search engine for “evidence masks work” will yield a lot of links to many well respected medical research sites discussing their benefits. While I do agree with them when they suggest that an ill person should seek treatment and stay home until they are healthy instead of relying on a mask, they are missing the obvious point. With COVID-19, many people can be symptom-free for many days – during that time they are able to infect anyone around them through the water droplets in their breath hanging in the air and landing on another person’s eyes or getting into their lungs. And as they point out in their next paragraph, it does seem that younger people tend to not get as sick as older adults. So the wearing of masks is important here too as the masks on the young will decrease the chance of spreading, and the masks on the older will further reduce their chances of inhaling an errant cough particle. Until an effective vaccination or other treatment is available for COVID-19, wearing a mask one of the few actions we can take to protect ourselves.

Finally, we get to what appears to be their main point: children in schools.

They begin by stating several “facts” about the rarity of certain events: how often young people contract COVID-19, how often they get sick, and finally how often the virus is transmitted to adults around them. There are many well-documented cases of people who spread viruses but never show the symptoms – does Typhoid Mary ring a bell?. If you use an Internet search engine for “covid-19 transmission vectors” you’ll find numerous medical research articles where they found the exact opposite – the ability to spread COVID-19 is not clearly related to age.

In that section, they have a number of sentences that bring up “facts” about fatalities attributed to other sicknesses such as Influenza. They specifically mention pediatric fatalities attributed to COVID-19 are “somewhere between 3 and 30, in the USA”. A quick search of “pediatric coronavirus deaths in US” brings up this information from the CDC which seems to corroborate their information:

As of April 2, 2020, the coronavirus disease 2019 (COVID-19) pandemic has resulted in >890,000 cases and >45,000 deaths worldwide, including 239,279 cases and 5,443 deaths in the United States (1,2). […] Three deaths were reported among the pediatric cases included in this analysis.

https://www.cdc.gov/mmwr/volumes/69/wr/mm6914e4.htm

That is good, but the following sentence raises an alarm:

These data support previous findings that children with COVID-19 might not have reported fever or cough as often as do adults (4). Whereas most COVID-19 cases in children are not severe, serious COVID-19 illness resulting in hospitalization still occurs in this age group. Social distancing and everyday preventive behaviors remain important for all age groups as patients with less serious illness and those without symptoms likely play an important role in disease transmission (6,7).

https://www.cdc.gov/mmwr/volumes/69/wr/mm6914e4.htm

The same source for their fact on the “low risk” that COVID-19 plays to our children go on to explain that this is probably due in large part to the infection being overlooked in children (i.e. infecting others), combined with the current social distancing and other preventative measures we have had in place. These actions ended the 2019/2020 school year early; as a parent, I’m worried that this fall we will have a dramatic increase in infections of our children that will cause the pediatric fatality number to go well beyond “3 and 30”.

You may have noticed that I put the word “facts” in quotation marks above. I’m not doing this for dramatic effect, rather I’m trying to point out that many of their figures and comments are stated as “facts” but there are no links to where that data came from. For most of my facts and comments I’ve noted here, I’ve tried to put links to multiple sources where possible. Their document does not provide any of this – you’re expected to take all of this at face value and not question anything.

And that’s what probably has me the most concerned. Our society has been built on learning from each other and having active discussions around topics so a wider audience can be informed and hopefully at the end of the day all sides come away with new and better information. Too many of us are taking the easy way and either failing to engage to improve our understanding of the topics, while others resort to grade-school level name-calling and shouting down instead of discussions.

Taken as a whole, the letter provided by “Family Medicine at Legacy” feels like it was written only to appease a certain mindset individual who wants to ignore reality and hope this all “goes away” overnight without needing to be further inconvenienced. It’s this mindset that makes me think that our society has reached a tipping point and we’re collectively the “fragile snowflake” more than the strong and resilient humans we claimed we were a few decades before.

Categories
Uncategorized

VMs built with Packer

Revamping my home lab VM build process using Packer, and I ran into an error where my VMs were being killed off soon after they booted from the ISO. Sadly, the error messages went by so quickly I could only see this:

reboot: System halted

Not helpful at all. 🙁

I installed OBS to record the screen so I could rewind the output. That helped, and I could finally find earlier error messages with this:

dracut-cmdline[324]: //lib/dracut/hooks/cmdline/25-parse-anaconda-options.sh: line 21: echo: write error: No space left on device

Not a lot more helpful – basically that was the only error message I could get out of the failed system.

A quick bit of Googling showed that the ‘dracut-cmdline’ tool expands the boot RAM disk into RAM, and the 1GB of RAM that the VM was using was insufficient.

I increased the RAM setting in the packer JSON file to 8GB, and the system booted just fine. I’m sure 4GB or even 2GB might be sufficient, but I’ll play with this option at a later date.

JSON file entry:

"memory": "8192",
Categories
Uncategorized

Buildup

Thanks to “https://www.reddit.com/user/MaricxX/” for this photo – it demonstrates how small glitches over time can add up if they aren’t addressed rapidly – or better yet, not allowed to start in the first place.

Cross section of layers of paint showing deformation due to imperfections magnified with each layer.
Layers of paint – credit to MaricxX from Reddit – https://www.reddit.com/user/MaricxX/

At a previous job it was common to take our Windows virtual machine templates and power them on once a month to patch the OS and apply the latest security configurations. We had been doing this with our Red Hat Linux images, but a couple years ago I converted our process so each month we built those VM templates fresh from an ISO and a Hashicorp Packer script using VMware Workstation.

This monthly fresh build ensured that we always knew how to build the VM templates in the event of a disaster, and it ensured that our build process contained exactly what we planned and advertised (through our team Git repository). As new requirements were received from the InfoSec team or other sources with system concerns that could only be readily addressed during the initial build phase, we would add those steps to the Packer config file, then test and build new.

With the prevalence of new worms and other highly effective infection vectors, my fear was that we would get a piece of malware onto the templates and then that malware would be automatically replicated each time a new system was built. And there were many times when we started the patching process each month only to find that a couple of the Windows templates had been left running since the previous months patch effort. There is no telling what might have crawled onto these unmanaged systems in the intervening time, only waiting for us to start using them over time.

While the paint analogy doesn’t perfectly match with the IT world, there are sufficient correlations that it makes the possibility of replicating and amplifying a small defect all the more understandable. Still, I prefer to have my freshly-built template with it’s minimal layers of paint knowing that I am confident that it only contains the bits we wanted.

Categories
Uncategorized Weekly Update

$RANDOM

Friday was my last day – and the weather was poor enough (snow with freezing rain) that the company sent an email the day before telling people to work from home if they could. I am glad I worked from home – I think I was able to get a lot of documentation wrapped up and some last-minute things completed and handed off. Even if I had another two weeks, I still wouldn’t have handed things off properly. There would always be one more thing to work on, one more thing to clean up, one more thing to polish. And the kicker was that I wasn’t truly handing things off as much as throwing documentation and notes into README.md files and Wiki pages and hoping someone at a future date would find them and keep the ball moving forward. But, all things come to and end – I’m looking forward to my new job starting this week and I wonder what sort of things I’ll get into next. 😀

Earlier this week Jilli sent Kris a text telling her that a Mountain Lion was roaming campus. Students were to call 911 immediately if they saw it. I was concerned that her first reaction would be to call “Here kitty, kitty!” and try to pet it. My next vision was Jilli and her friends running away from the lion, each with their phones in their hands Googling “How to escape a Mountain Lion”…

Liz had a normal week at school. She and Kris spent a lot of time together since I had a lot of late nights wrapping up work and helping with my parents. She’s continuing to use her weight training bag in the basement, plus she’s starting to cook more and more. Ready to bake Croissant Rolls are being made frequently, as are Chocolate Chip cookies. I’ve eaten way too many of both this week – my post-Christmas weight loss isn’t working.

The cold/crud that I brought home over Christmas has left me, but is continuing to annoy Kris. She was just starting to get over the worst of the coughing when she hurt a muscle in her back coughing so much. She was in a lot of pain after school on Friday – she says even sleeping is painful sleeping on her back and putting pressure on the muscle. I really need to talk to her mom about her bodies warranty coverage…

Mom and Dad both continue to kick around AV. It was so cold and icy Saturday morning that we decided not to go out to their house so I rescheduled the home inspection for another week. We met with a new financial advisor this week, but I keep hoping we stumble across some gold bars or a hand full of un-sold “Berkshire A” stock certificates. Probably not likely, but I can keep my fingers crossed.