28 January 2022

Nutanix LCM – Insufficient space on ESXi scratch disk

I was running into an issue where I could not run the Nutanix LCM Inventory action on a cluster because the scratch disk on an ESXi host was too small. And it seems reasonable to assume that if you’re reading this article, then you too probably have encountered an with the pre-check “test_esxi_scratch_space“.

I’ve seen the issue a few times now, and in my experience, the ESXi host has always just had its scratch disk set to the wrong disk. The first couple of times I saw this, the fix was to update the scratch disk from within the advanced host settings in VCSA. The part I did not like about fixing the issue this way, was that it required a reboot of the host for the setting to take hold. Having to schedule a maintenance period or create downtime is never an ideal solution. Thankfully I learned that there is another way to repoint the scratch disk that requires no downtime and it only requires a few lines of CLI against the ‘problem’ ESXi host.

Start by connecting via SSH to the ESXi host that is having the issue with the scratch disk.

Run the command “ls -ll /scratch” to find which volume is currently set as the scratch disk

root@ESXi# ls -ll /scratch
lrwxrwxrwx    1 root     root            49 May  8 23:40 /scratch -> /vmfs/volumes/5xyzxyz6-dxyzxyzb-1c73-ac1xyzxyz990

Run the “df -h” command to list all of the disks on the host and their sizes

root@ESXi# df -h

Filesystem   Size   Used Available Use% Mounted on
NFS          1.6T   1.4T    127.4G  92% /vmfs/volumes/OS-XXX-Repoxxx
VMFS-5      52.0G   1.1G     50.9G   2% /vmfs/volumes/NTNX-local-ds-17xyzz340111-B
vfat         4.0G  27.6M      4.0G   1% /vmfs/volumes/5xyzxyz-1234xyzz-12xy-1234xyzz1234
vfat       285.8M 205.8M     80.0M  72% /vmfs/volumes/5xyzxyz6-dxyzxyzb-1c73-ac1xyzxyz990
vfat       249.7M 152.6M     97.2M  61% /vmfs/volumes/58xyzxyz-cdxyzxyz-766a-12xyzxyz1226
vfat       249.7M 145.3M    104.4M  58% /vmfs/volumes/b4xyzxyz-80xyzxyz-9bf2-e5xyzxyzf6d0

Now that we have the current scratch disk and a list of the sizes of all the disks, we can check if the scratch volume is indeed set to the volume that is 4GB in size.

In the example above we can see that the volume “/vmfs/volumes/5xyzxyz6-dxyzxyzb-1c73-ac1xyzxyz990” is only 285MB in size. That means that this current volume is far too small. No wonder we’re getting an error.

We want to set our scratch disk to a volume that is 4GB in size. According to the list above that means we want to use the volume “/vmfs/volumes/5xyzxyz-1234xyzz-12xy-1234xyzz1234”. To set the desired scratch disk we’ll use the command “ln -sfn <volume_id> /scratch”.

root@ESXi# ln -sfn /vmfs/volumes/5xyzxyz-1234xyzz-12xy-1234xyzz1234 /scratch

If we recheck what the scratch disk is on our host, we’ll see that it is now set to the proper disk volume.

root@ESXi# ls -ll /scratch

lrwxrwxrwx    1 root     root            49 May  8 23:40 /scratch -> /vmfs/volumes/5xyzxyz-1234xyzz-12xy-1234xyzz1234

Now that the scratch disk is properly configured on the host we can update it in VCSA and be done.

From the Host, go to Configure, then Advanced Systems Settings, and click “Edit”.
Select “ScratchConfig.CurrentScratchLocation” and set it to the same value that you just manually configured on host. Hit “Apply”, and you’ll see that the VCSA now recognizes the newly configured scratch disk.

Well now we’re done, and we didn’t even need to reboot a single physical host! You can read more about this error in Nutanix’s KB article about it.

26 November 2021

Nutanix services

Nutanix relies on the following services to run…

  • Acropolis
  • Genesis
  • Zookeeper
  • Zeus
  • Medusa
  • Cassandra
  • Stargate
  • Curator

Acropolis

An Acropolis follower runs on every CVM with an elected Acropolis leader. The Acropolis follower is responsible for statistic collection and publishing and provides VNC proxy capabilities. The Acropolis leader is responsible for stat collection and publishing, task scheduling and execution, VM placement and scheduling, network controller, and VMC proxy.

Genesis

Genesis is a process that runs on each node and is responsible for any services interactions (start/stop/etc.) as well as for the initial configuration. Genesis is a process that runs independently of the cluster and does not require the cluster to be configured/running. The only requirement for Genesis to be running is that Zookeeper is up and running.

Zookeeper

Zookeeper stores information about all cluster components (both hardware and software), including their IP addresses, capacities, and data replication rules, in the cluster configuration. Zookeeper has no dependencies, meaning that it can start without any other cluster components running.

Zookeeper is active on either three or five nodes, depending on the redundancy factor (number of data block copies) applied to the cluster. Zookeeper uses multiple nodes to prevent stale data from being returned to other components. An odd number provides a method for breaking ties if two nodes have different information. Of these nodes, Zookeeper elects one node as the leader. The leader receives all requests for information and confers with its follower nodes. If the leader stops responding, a new leader is elected automatically.

Zeus

Zeus is an interface to access the information stored within Zookeeper and is the Nutanix library that all other components use to access the cluster configuration.

A key element of a distributed system is a method for all nodes to store and update the cluster’s configuration. This configuration includes details about the physical components in the cluster, such as hosts and disks, and logical components, like storage containers.

Medusa

Distributed systems that store data for other systems (for example, a hypervisor that hosts virtual machines) must have a way to keep track of where that data is. In the case of a Nutanix cluster, it is also important to track where the replicas of that data are stored.

Medusa is a Nutanix abstraction layer that sits in front of the database that holds metadata. The database is distributed in a ring topology across multiple nodes in the cluster for resiliency, using a modified form of Apache Cassandra.

Cassandra

Nutanix’s implementation of Cassandra uses a version of Apache Cassandra that has been modified for high performance and automatic, on-demand scaling. Cassandra stores all metadata about the guest VM data in a Nutanix storage container.

Cassandra runs on all nodes of the cluster. Cassandra monitor Level-2 periodically sends a heartbeat to the daemon, which includes information about the load, schema, and health of all the nodes in the ring. Cassandra monitor L2 depends on Zeus/Zk for this information.

Stargate

A distributed system that presents storage to other systems (such as a hypervisor) needs a unified component for receiving and processing data that it receives. The Nutanix cluster has a software component called Stargate that manages this responsibility.

All read and write requests are sent across an internal vSwitch to the Stargate process running on that node. Stargate depends on Medusa to gather metadata and Zeus to gather cluster configuration data. From the perspective of the hypervisor, Stargate is the main point of contact for the Nutanix cluster.

Curator

A Curator leader node periodically scans the metadata database and identifies cleanup and optimization tasks that Stargate should perform. Curator shares analyzed metadata across other Curator nodes. The Curator depends on Zeus to learn which nodes are available, and Medusa to gather metadata. Based on that analysis, it sends commands to Stargate.

Source: Nutanix University’s Enterprise Cloud Administration training

Category: Nutanix | LEAVE A COMMENT
1 November 2021

Nutanix Cheat Sheet

Hopefully, this helps you as much as it helps me. This is by no means a comprehensive list. It’s just a place for me to jot down the various commands I use as I get to know Nutanix more intimately.

Run all NCC Health Checks

ncc health_checks run_all

Shutdown CVM

cvm_shutdown -P now

Check status of CVM metadata ring, see if all CVMs are ‘UP’

nodetool -h 0 ring

Check Cluster Status on CVM

Cluster status

Check if CVM processes are in UP state

cluster status | grep -v UP

Check CVM Metadata store status

ncli host ls | egrep "Meta[Id]Name"

Verify data resiliency

ncli cluster get-domain-fault-tolerance-status type=node

Check which CVM is the Minerva Leader

afs info.get_leader

Check which CVM is the Prism Leader

afs info.prism_leader

Check which CVM is the LCM Leader

lcm_leader

Start cluster

Cluster start

Restart prism on CVM

genesis restart

Prism/CVM Status

genesis status

Check if CVM is in Maintenance Mode
Note: Only the Scavenger, Genesis, and Zeus processes must be running (process ID is displayed next to the process name).

genesis status | grep -v "\[\]"

Cluster/Host Hardware Info (RAM, DIMMs, CPUs, etc…) from CVM

ncc hardware_info show_hardware_info

Migrate VM to a different storage container (AOS >= 5.19)

acli vm.update_container vm-name container=target-container wait=false

Change AHV host name (AOS >= 5.20)

change_ahv_hostname --host_ip=HOST_IP --host_name=NEW-AHV-HOSTNAME

Get all CVM IPs within the cluster

svmips

Get all Host IPs within the cluster

hostips

Get all IPMI IPs within the cluster

ipmiips

Get Cluster Info

ncli cluster info

Get all Hosts Info

acli host.info
ncli host ls

Verify the state of Host
-Entered Maintenace: node_state equals to kEnteredMaintenanceMode and schedulable equals to False.
-Exited Maintenace: node_state equals to kAcropolisNormal and schedulable equals to True.

acli host.get host-ip

Put a CVM in maintenance mode

ncli host edit id=HOST_ID enable-maintenance-mode=true

Exit a CVM from maintenance mode

ncli host edit id=HOST_ID enable-maintenance-mode=false

Put an AHV host in maintenance mode
Note: “wait=true” allows the host to migrate VMs to other hosts before it enters maintenance mode.

acli host.enter_maintenance_mode HOST_IP wait=true

Exit an AHV host from maintenance mode

acli host.exit_maintenance_mode HOST_IP

Check if AHV host is Schedulable

acli host.list

Check AOS version on all CVMs

allssh 'cat /etc/nutanix/release_version'

Check AHV version on all nodes

hostssh 'cat /etc/nutanix-release'

List all VMs on a cluster

acli vm.list

List all VMs on a host

acli host.list_vms host

List VMs in a powered ON state

acli vm.list power_state=on

List VMs in a powered OFF state

acli vm.list power_state=off

Power off all VMs running on the cluster

for vm_name in `acli vm.list power_state=on | grep -v ^'VM
name' | awk '{print $1}'`; do acli vm.force_off $vm_name; done

Power on all VMs running on the cluster

for vm_name in `acli vm.list power_state=off | grep -v ^'VM
name' | awk '{print $1}'`; do acli vm.on $vm_name; done
Category: Nutanix | LEAVE A COMMENT
6 August 2021

Nutanix password change

If you leave the default passwords on your Nutanix cluster you’ll start to see alerts in Prism that the default password is still in use. It will alert you about it for both the CVM and the physical hosts. This alert is very easy to clear by just updating the password. Here’s how…


To run just the default password health check from your CVM you can use the following command:

nutanix@cvm$ ncc health_checks system_checks default_password_check

Or you can also run the complete set of NCC health checks:

nutanix@cvm$ ncc health_checks run_all

If the health check passes, you’ll see this line in the output:

/health_checks/system_checks/default_password_check              [ PASS ]

If the health check fails you’ll see this in the output and it will tell you which host(s) alerted:

/health_checks/system_checks/default_password_check              [ INFO ]
------------------------------------------------------------------------+
Detailed information for default_password_check:
Node x.x.x.x:

Nutanix Controller VM (CVM) password change

Running this command will prompt you for your new desired password for the ‘nutanix’ user on the CVM:
nutanix@cvm$ sudo passwd nutanix

Once you change the CVM’s password it will replicate to all of the CVMs in your cluster, thus changing the password on all of your CVMs at once.

Hypervisor password change

  • AHV
    To change the local “admin” account password for all AHV hypervisors in the Nutanix cluster, you can run this command from any CVM in the cluster.
    nutanix@cvm$ echo -e "CHANGING ALL AHV HOST ADMIN PASSWORDS. Note - This script cannot be used for passwords that contain special characters ( $ \ { } ^ &)\nPlease input new password: "; read -s password1; echo "Confirm new password: "; read -s password2; if [ "$password1" == "$password2" ] && [[ ! "$password1" =~ [\{\$\^}\&] ]]; then hostssh "echo -e \"admin:${password1}\" | chpasswd"; else echo "The passwords do not match or contain invalid characters (\ $ { } ^ &)"; fi
    To change the local “nutanix” account password for all AHV hypervisors in the Nutanix cluster, you can run this command from any CVM in the cluster.
    nutanix@cvm$ echo -e "CHANGING ALL AHV HOST NUTANIX PASSWORDS. Note - This script cannot be used for passwords that contain special characters ( $ \ { } ^ &)\nPlease input new password: "; read -s password1; echo "Confirm new password: "; read -s password2; if [ "$password1" == "$password2" ] && [[ ! "$password1" =~ [\{\$\^}\&] ]]; then hostssh "echo -e \"nutanix:${password1}\" | chpasswd"; else echo "The passwords do not match or contain invalid characters (\ $ { } ^ &)"; fi

  • VMware ESXi 
    To change the local root password for all ESXi hosts in the cluster, you can run this command from any CVM in the cluster.
    nutanix@cvm$ echo -e "CHANGING ALL ESXi HOST PASSWORDS. Note - This script cannot be used for passwords that contain special characters ( $ \ { }  ^ &)\nPlease input new password: "; read -s password1; echo "Confirm new password: "; read -s password2; if [ "$password1" == "$password2" ] && [[ ! "$password1" =~ [\\\{\$\^\}\&] ]]; then hostssh "echo -e \"${password1}\" | passwd root --stdin"; else echo "The passwords do not match or contain invalid characters (\ $ { } ^ &)"; fi

  • Microsoft Hyper-V 
    To change the local administrator password for all Hyper-V hosts in the cluster, you can run this command from any CVM in the cluster.
    nutanix@cvm$ echo -e "CHANGING ALL HYPER-V HOST PASSWORDS. Note - This script cannot be used for passwords that contain special characters ( $ \ { }  ^)\nPlease input new password: "; read -s password1; echo "Confirm new password: "; read -s password2; if [ "$password1" == "$password2" ] && [[ ! "$password1" =~ [\ \"\'\\\{\$\^\}] ]]; then hostssh "net user administrator $password1"; echo "Updating Host and ManagementServer Entries..."; ncli host ls | grep -i id | grep -Eo "::[0-9]*" | cut -c 3- | while read hID; do ncli host edit id=$hID hypervisor-password=$password1;done  > /dev/null; ncli host ls | grep "Hypervisor Address" | awk '{print $4}' | while read hIP; do ncli managementserver edit name=$hIP password=$password1;done > /dev/null;  else echo "The passwords do not match or contain invalid characters (\ $ { } ^)"; fi

Further info can be found in the following Nutanix KB.

Category: Nutanix | LEAVE A COMMENT
3 August 2021

Restart Prism

You might have some problems with your Nutanix Prism someday and need to restart the Prism service without restarting your CVM or host or anything else. Whether it is super slow page loads, overall delay in the WebGUI, or some other problem. Thankfully you can safely restart the Prism service in a way that won’t have an impact your production environment.

SSH into any of your CVMs and run the line below.

curl http://0:2019/prism/leader && echo

It will reply back with the response of either {"leader":"x.x.x.x:9080","is_local":true} if it is the Prism leader or {"leader":"x.x.x.x:9080","is_local":false} if it is not the Prism leader. If it is not the leader, you will be able to see the IP address of the CVM you will need to connect to returned.

Now that you have SSHed into the Prism leader, you can run the command below to stop the service.

genesis stop prism

To re-start the Prism service, simply use this command.

cluster start

Your Prism is back up and running. Something to note is that the Prism leader may now be a different CVM, it does have to start up on the same CVM as before the restart. If you want to check which CVM is now residing as the leader, you can re-run the first command I mentioned and see what returns from the curl command.

Another handy command to know for just restarting the Genesis service is:

genesis restart
Category: Nutanix | LEAVE A COMMENT