Monitor VMware ESXi vsish data via shell script and Telegraf
This is a quick example on how to monitor VMware ESXi vsish data. There are multiple ways of getting the data and fetching the data externally. In this post we will grab the data via shell script which is run via ESXi Cron every x minutes. We save the data to a formatted CSV and then fetch the data via Telegraf HTTP input plugin.
The data we are going to fetch are specified VM's VMXNET3 interface Rx Ring buffer exhaustion, RX drops and burst queue deliveries. This actually solved a real world case where from time to time we were getting bursts of traffic and we couldn't get the timeline nor the specific details on where the traffic was getting dropped.
Even though this example is for the network side of things, it is really easy to modify the script to fetch any data you want.
Disclaimer: Please note that running custom scripts on ESXi can be dangerous. Remember that you should disallow running of custom executables on ESXi by default. This would help in mitigating possible issues with Cryptolockers and other malware.
Identifying the data to get using vsish
As said before we are looking into VMXNET3 network Rx Buffers of a specific VM.
Identify the VMXNET3 inteface and the port IDs. Check out the vsish commands which are going to be parsed with the script.
root@lab-esxi01:~] net-stats -l
PortNum Type SubType SwitchName MACAddress ClientName
33554434 4 0 vSwitch0 e4:43:4b:7d:8d:05 vmnic0
33554436 3 0 vSwitch0 e4:43:4b:7d:8d:05 vmk0
...
50331758 9 0 DvsPortset-0 00:50:56:ac:23:90 test-vm.eth3
50331760 9 0 DvsPortset-0 00:50:56:ac:23:a6 test-vm.eth4
50331761 9 0 DvsPortset-0 00:50:56:ac:43:a1 test-vm.eth5
50331762 9 0 DvsPortset-0 00:50:56:ac:52:2c test-vm.eth6
50331763 9 0 DvsPortset-0 00:50:56:ac:63:9f test-vm.eth7
root@lab-esxi01:~] vsish -e cat /net/portsets/DvsPortset-0/ports/50331758/clientStats
port client stats {
...
}
root@lab-esxi01:~] vsish -e cat /net/portsets/DvsPortset-0/ports/50331758/vmxnet3/rxSummary
stats of a vmxnet3 vNIC rx queue {
...
}
Get the vsish data via script and output to a CSV-file
This script will parse the "net-stats -l", grab the specifics (port number, switch name and vmname). Use this data to query the counters via vsish commands. See the code for more detailed comments.
Note: You can of course modify the script easily to get outputs from all VM's (for example remove matching to the VM name and match DvsPortset-0 (if using dvSwitch that is) and exclude vmk and vmnic).
#!/bin/sh
buffer_monitor(){
fullpath=$1
# Loop through the net-stats table, while getting all interfaces for "test-vm" and ouputting only specific data
for row in $(net-stats -l | grep "test-vm" | awk '{print $1","$4","$6","$5}');
do
# Get all the parameters into variables
vmName=$(echo $row|awk -F"," '{print $3}');
portNum=$(echo $row|awk -F"," '{print $1}');
switchName=$(echo $row|awk -F"," '{print $2}');
macAddr=$(echo $row|awk -F"," '{print $4}');
# Get the data from counters using the vsish commands and parsing the output data
droppedRx=$(vsish -e cat /net/portsets/$switchName/ports/$portNum/clientStats | grep "droppedRx:" | awk -F":" '{print $2}');
outOfBuf=$(vsish -e cat /net/portsets/$switchName/ports/$portNum/vmxnet3/rxSummary | grep "running out of buffers:" | awk -F":" '{print $2}');
ring1Full=$(vsish -e cat /net/portsets/$switchName/ports/$portNum/vmxnet3/rxSummary | grep "1st ring is full:" | awk -F":" '{print $2}');
ring2Full=$(vsish -e cat /net/portsets/$switchName/ports/$portNum/vmxnet3/rxSummary | grep "2nd ring is full:" | awk -F":" '{print $2}');
burstDrop=$(vsish -e cat /net/portsets/$switchName/ports/$portNum/vmxnet3/rxSummary | grep "dropped by burst queue:" | awk -F":" '{print $2}');
burstDelivered=$(vsish -e cat /net/portsets/$switchName/ports/$portNum/vmxnet3/rxSummary | grep "delivered by burst queue:" | awk -F":" '{print $2}');
# Get current time
time_now=$(date '+%Y-%m-%dT%H:%M:%SZ');
# Output the parsed data to CSV
echo vsish,$time_now,$vmName,$portNum,$macAddr,$droppedRx,$outOfBuf,$ring1Full,$ring2Full,$burstDrop,$burstDelivered >> $fullpath;
echo -ne #
done
}
# Take the output and vmfs name as parameters
output_file=$1
vmfs=$2
# Output the header row
echo measurement,time,vm,portnum,macaddress,droppedrx,runoutofbuf,1stringfull,2ndringfull,burstdrop,burstdelivered > /vmfs/volumes/$vmfs/buf_log/$output_file
# Run buffer_monitor function with whole output path as a parameter
buffer_monitor /vmfs/volumes/$vmfs/buf_log/$output_file;
Testing and running the script
root@lab-esxi01:~] sh buffer_monitor.sh vsish_output.csv volume-name
root@lab-esxi01:~] cat /vmfs/volumes/volume-name/buf_log/vsish_output.csv
measurement,time,vm,portid,macaddress,droppedrx,runoutofbuf,1stringfull,2ndringfull,burstdrop,burstdelivered
vsish,2022-02-06T13:37:29Z,test-vm.eth3,50331758,00:50:56:ac:23:90,12301,25233,25233,0,0,10500
vsish,2022-02-06T13:37:29Z,test-vm.eth4,50331759,00:50:56:ac:23:a6,1233,24444,24444,0,0,1032
vsish,2022-02-06T13:37:29Z,test-vm.eth5,50331760,00:50:56:ac:43:a1,0,0,0,0,0,0
vsish,2022-02-06T13:37:29Z,test-vm.eth6,50331761,00:50:56:ac:52:2c,0,0,0,0,0,0
vsish,2022-02-06T13:37:29Z,test-vm.eth7,50331762,00:50:56:ac:63:9f,0,0,0,0,0,0
vsish,2022-02-06T13:37:29Z,test-vm.eth2,50331763,00:50:56:ac:55:7f,0,0,0,0,0,0
vsish,2022-02-06T13:37:29Z,test-vm.eth1,50331764,00:50:56:ac:67:42,0,0,0,0,0,0
vsish,2022-02-06T13:37:30Z,test-vm.eth0,50331765,00:50:56:ac:e3:76,0,0,0,0,0,0
Automate the script in Cron
# Add to cron running 1 minute intervals
root@lab-esxi01:~] vi /var/spool/cron/crontabs/root
*/1 * * * * sh /buffer_monitor.sh output.csv volume-name
# Kill and restart the cron process
root@lab-esxi01:~] esxcli system process list | grep cron
2426225 2426225 busybox superDom /usr/lib/vmware/busybox/bin/busybox crond
root@lab-esxi01:~] kill 2426225
root@lab-esxi01:~] /usr/lib/vmware/busybox/bin/busybox crond
Fetch the data via Telegraf
For this you must also create a read only user which allows the data to be grabbed via the HTTPS datastore browser. You can find the correct URL by trying to fetch it via browser. This will get the data from the HTTPS URL configured, you still must configure output as you normally would to (for example) InfluxDB.
# VSISH data fetch
[[inputs.http]]
urls = [
"https://ESXI-ADDRESS/folder/buf_log/vsish_output.txt?dcPath=ha%2ddatacenter&dsName=lab%252desxi01%252dvmfs"
]
interval = "1m"
# Use TLS but skip chain & host verification
insecure_skip_verify = true
# User and password of the read only account
username = "read_only_esxi"
password = "xxxxxx"
data_format = "csv"
tagexclude = ["host","url"]
csv_header_row_count = 1
csv_measurement_column = "measurement"
csv_timestamp_column = "time"
csv_timestamp_format = "2006-01-02T15:04:05Z07:00"
csv_tag_columns = ["vm", "portnum","macaddress"]
[inputs.http.tags]
tag1 = "vsish_monitoring"
Conclusion
Following the steps before will allow you to fetch the data using Telegraf (or other tool of your choice) and save the data to the database of your choosing. This in turn would allow graphing the data very easily for example in Grafana.
All the scripts above can be found in the GIT repository