How to fix Ceph error states “oldest is osd_failure(failed timeout osd.xx)

0
1609

Hi,

If your Ceph cluster encounters a slow/blocked operation it will log it and set the cluster health into Warning Mode.

Slow/blocked ops are synonyms as far as Ceph is concerned – both mean the same thing.

Generally speaking, an OSD with slow requests is every OSD that is not able to service the I/O operations per second (IOPS) in the queue within the time defined by the osd_op_complaint_time parameter. By default, this parameter is set to 30 seconds.

The main causes of OSDs having slow requests are:

  • Problems with the underlying hardware, such as disk drives, hosts, racks, or network switches
  • Problems with the network are usually connected with flapping OSDs.
  • System load

Start to troubleshoot in this order:

  1. Look in the monitor logs (systemctl status [email protected])
  2. Look in the OSD logs (systemctl status [email protected])
    1. Check Disk Health (SMART)
    2. Check Network Health (Network diagnostic tools)
    3. Check the details at /var/log/syslog
    4. Check OSD status “ceph daemon osd.xx status” and “ceph daemon osd.xx ops”

Example

Cluster shows health warning:

[email protected]:~# ceph -s
cluster:
id: c2063d70-6a16-4edb-a486-22f46450a5ac
health: HEALTH_WARN
4 slow ops, oldest one blocked for 4364 sec, mon.node1-neu has slow ops
...

This will give us a clue where to look first. Monitor service running on the host named “node1-neu”

Check the status of the monitor service on host node1-neu

[email protected]:~# systemctl status [email protected][email protected] - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled)
Active: active (running) since Thu 2020-07-09 19:44:51 UTC; 18min ago
Main PID: 748840 (ceph-mon)
CGroup: /system.slice/system-ceph\x2dmon.slice/[email protected]
└─748840 /usr/bin/ceph-mon -f --cluster ceph --id node1-neu --setuser ceph --setgroup ceph

Jul 9 19:40:56 node1-neu ceph-mon[2983561]: 2020-07-09 19:40:56.890 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935)
Jul 9 19:41:01 node1-neu ceph-mon[2983561]: 2020-07-09 19:41:01.890 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935)
Jul 9 19:41:06 node1-neu ceph-mon[2983561]: 2020-07-09 19:41:06.890 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935)
Jul 9 19:41:11 node1-neu ceph-mon[2983561]: 2020-07-09 19:41:11.894 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935)
Jul 9 19:41:16 node1-neu ceph-mon[2983561]: 2020-07-09 19:41:16.894 7fce60718700 -1 [email protected](leader) e1 get_health_metrics reporting 4 slow ops, oldest is osd_failure(failed timeout osd.10 [v2:10.111.111.53:6808/20526,v1:10.111.111.53:6809/20526] for 28sec e3935 v3935)

It looks like mon.node1-neu got hung up checking the latest OSD status. Verify all the OSD are working. Restart mon.node1-neu service to clear the old warning.

systemctl restart [email protected]

Verify the service restarted correctly

[email protected]:~# systemctl status [email protected][email protected] - Ceph cluster monitor daemon
Loaded: loaded (/lib/systemd/system/[email protected]; enabled; vendor preset: enabled)
Active: active (running) since Thu 2020-07-09 19:44:51 UTC; 18min ago
Main PID: 748840 (ceph-mon)
CGroup: /system.slice/system-ceph\x2dmon.slice/[email protected]
└─748840 /usr/bin/ceph-mon -f --cluster ceph --id node1-neu --setuser ceph --setgroup ceph

Jul 09 19:44:51 node1-neu systemd[1]: Started Ceph cluster monitor daemon.
Jul 09 19:44:54 node1-neu ceph-mon[748840]: 2020-07-09 19:44:54.177 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model
Jul 09 19:45:00 node1-neu ceph-mon[748840]: 2020-07-09 19:45:00.365 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model
Jul 09 19:45:05 node1-neu ceph-mon[748840]: 2020-07-09 19:45:05.421 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model
Jul 09 19:45:10 node1-neu ceph-mon[748840]: 2020-07-09 19:45:10.509 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model
Jul 09 19:45:15 node1-neu ceph-mon[748840]: 2020-07-09 19:45:15.550 7f1f66b34700 -1 [email protected](electing) e1 failed to get devid for : fallback method has serial ''but no model

And verify that your cluster is back to “Healthy Status”

[email protected]:~# ceph -s
cluster:
id: c2063d70-6a16-4edb-a486-22f46450a5ac
health: HEALTH_OK

Hope this helps

LEAVE A REPLY

Please enter your comment!
Please enter your name here