Hi,
If your Ceph cluster encounters a slow/blocked operation it will log it and set the cluster health into Warning Mode.
Slow/blocked ops are synonyms as far as Ceph is concerned – both mean the same thing.
Generally speaking, an OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.
The main causes of OSDs having slow requests are:
- Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
- Problems with the network are usually connected with flapping OSDs.
- System load
Start to troubleshoot in this order:
- Look in the monitor logs (systemctl status [email protected])
- Look in the OSD logs (systemctl status [email protected])
- Check Disk Health (SMART)
- Check Network Health (Network diagnostic tools)
- Check the details at /var/log/syslog
- Check OSD status “ceph daemon osd.xx status” and “ceph daemon osd.xx ops”
Example
Cluster shows health warning:
[email protected]:~# ceph -s cluster: id: c2063d70-6a16-4edb-a486-22f46450a5ac health: HEALTH_WARN 4 slow ops, oldest one blocked for 4364 sec, mon.node1-neu has slow ops ...
This will give us a clue where to look first. Monitor service running on the host named “node1-neu”
Check the status of the monitor service on host node1-neu
[email protected]:~# systemctl status [email protected] ● [email protected] - Ceph cluster monitor daemon Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-07-09 19:44:51 UTC; 18min ago Main PID: 748840 (ceph-mon) CGroup: /system.slice/system-ceph\x2dmon.slice/[email protected] └─748840 /usr/bin/ceph-mon -f --cluster ceph --id node1-neu --setuser ceph --setgroup ceph Jul 9 19:40:56 node1-neu ceph-mon[2983561]: 2020-07-09 19:40:56.890 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935) Jul 9 19:41:01 node1-neu ceph-mon[2983561]: 2020-07-09 19:41:01.890 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935) Jul 9 19:41:06 node1-neu ceph-mon[2983561]: 2020-07-09 19:41:06.890 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935) Jul 9 19:41:11 node1-neu ceph-mon[2983561]: 2020-07-09 19:41:11.894 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935) Jul 9 19:41:16 node1-neu ceph-mon[2983561]: 2020-07-09 19:41:16.894 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935)
It looks like mon.node1-neu got hung up checking the latest OSD status. Verify all the OSD are working. Restart mon.node1-neu service to clear the old warning.
systemctl restart [email protected]
Verify the service restarted correctly
[email protected]:~# systemctl status [email protected] ● [email protected] - Ceph cluster monitor daemon Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled) Active: active (running) since Thu 2020-07-09 19:44:51 UTC; 18min ago Main PID: 748840 (ceph-mon) CGroup: /system.slice/system-ceph\x2dmon.slice/[email protected] └─748840 /usr/bin/ceph-mon -f --cluster ceph --id node1-neu --setuser ceph --setgroup ceph Jul 09 19:44:51 node1-neu systemd[1]: Started Ceph cluster monitor daemon. Jul 09 19:44:54 node1-neu ceph-mon[748840]: 2020-07-09 19:44:54.177 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model Jul 09 19:45:00 node1-neu ceph-mon[748840]: 2020-07-09 19:45:00.365 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model Jul 09 19:45:05 node1-neu ceph-mon[748840]: 2020-07-09 19:45:05.421 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model Jul 09 19:45:10 node1-neu ceph-mon[748840]: 2020-07-09 19:45:10.509 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model Jul 09 19:45:15 node1-neu ceph-mon[748840]: 2020-07-09 19:45:15.550 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model
And verify that your cluster is back to “Healthy Status”
[email protected]:~# ceph -s
cluster:
id: c2063d70-6a16-4edb-a486-22f46450a5ac
health: HEALTH_OK
Hope this helps