Uploaded image for project: 'OpenVZ'
  1. OpenVZ
  2. OVZ-6813

Node almost dead on startup with 042stab120.3

    XMLWordPrintable

    Details

      Description

      jWe have a problem with one server with the new 042stab120.3 kernel.

      Even with VE_PARALLEL=2, the server starts to hang. The console (both tty and ssh vty) starts hanging, becoming unresponsive even for simple Enter.

      I've noticed that the stall becomes much worse when anything netlink related is running in the just booting containers. Typically, ip link set up dev lo & ip a add for the venet0 iface sends the whole HW node to a long stall.

      We were unable to start more than 3-4 containers on that node with this kernel.

      Interestingly enough, there wasn't any problem with our 17 other HW nodes (~1200 CTs). I've checked twice that these HW nodes don't differ neither in SW nor in HW configuration. The only weird thing with other nodes is that iptables are loaded up much more slowly (though I don't have concrete numbers at hand).

      There was only vzctl hung task in dmesg buffer, I don't think it points to much:

      [ 601.797644] INFO: task vzctl:24089 blocked for more than 120 seconds.
      [ 601.797830] Tainted: P -- ------------ 2.6.32-042stab120.3 #1
      [ 601.798149] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
      [ 601.798468] vzctl D ffff88407115b3c0 0 24089 23970 0 0x00000080
      [ 601.798472] ffff88400020be08 0000000000000082 ffff88400020bdd0 ffff88207fc04080
      [ 601.798475] ffff883ffd439780 000000000112a9fb 0000006df3c32a7c ffff884073736410
      [ 601.798479] ffff882000000000 ffff884073736400 0000000100029e26 00000013eef38227
      [ 601.798484] Call Trace:
      [ 601.798489] [<ffffffff81556016>] __mutex_lock_slowpath+0x96/0x210
      [ 601.798494] [<ffffffff81555b3b>] mutex_lock+0x2b/0x50
      [ 601.798498] [<ffffffff810ef9ed>] cgroup_kernel_open+0x3d/0x120
      [ 601.798503] [<ffffffff810c50d2>] ub_cgroup_init+0x82/0xb0
      [ 601.798508] [<ffffffff810c6605>] ? alloc_ub+0xa5/0x100
      [ 601.798513] [<ffffffff810c6804>] get_beancounter_byuid+0x114/0x270
      [ 601.798519] [<ffffffff810c4f4c>] sys_setluid+0x6c/0xa0
      [ 601.798523] [<ffffffff8100b1a2>] system_call_fastpath+0x16/0x1b

      For now we had to roll back to a vulnerable kernel.

      I'm able to reproduce this at will.

      What can I do to help you debug this?

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              vvs Vasily Averin
              Reporter:
              snajpa Pavel Snajdr
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: