VMware has some really nice flings on flings.vmware.com although there are some rumors of the viability of these projects under the Broadcom flag for now they still exist :).
The vSAN Performance monitor fling provides some verry cool Grafana dashboards which gives you a great inside of the performance on your vSAN cluster.
After deploying the OVA you’ll get a VM with photonOS and 3 containers. Each with their own function.
The installation of this fling is verry well documented and there are several blogpost on the internet to help you out. So no need to expand on that.
But after running the containers and looking at the logs i saw that there where metrics dropped due to a full buffer.
#Finding out which containers are running
root@am4-vsp-m-01 [ ~ ]# docker ps
CONTAINER ID IMAGE ..
f29bd9f71f8b grafana/grafana:6.1.6 ..
543718acd157 vsananalytics/telegraf-vsan:0.0.7 ..
8d7523e6858a influxdb:1.7 ..
#getting logs from telegraf container
root@am4-vsp-m-01 [ ~ ]# docker logs 543718acd157
2023-10-02T12:20:12Z D! [outputs.influxdb] Buffer fullness: 10000 / 10000 metrics
2023-10-02T12:20:12Z W! [outputs.influxdb] Metric buffer overflow; 3552 metrics have been dropped
Although there was data visible in Grafana I wanted to be sure that there where no dropped metrics. To achieve this we had to increase the metric buffer size.
root@am4-vsp-m-01 [ ~ ]# vi telegraf.conf
#at the bottom of the config file there is an agent section. You can add there the new metric buffer limit, save with :wq!
debug = true
quiet = false
metric_buffer_limit = 40000
#After changing the buffer limit you need to rebuild/restart/compose the containers.
root@am4-vsp-m-01 [ ~ ]# docker-compose down
root@am4-vsp-m-01 [ ~ ]# docker-compose up -d
#check with docker logs (It takes around 5 minutes before the metrics starts logging)
Two things to keep in mind with the metric_buffer_limit.
- If you increase this buffer the VM is gonna use more memory so check with TOP if you have enough memory left. (I set it to 4GB on the VM)
- I tested when the metric_buffer_limit was optimal. So you may need to adjust it a few times to find the correct mix between used memory and buffer limit.