Uploaded image for project: 'OpenVZ'
  1. OpenVZ
  2. OVZ-4888

UBCs contain CTIDs that have are deleted unmounted down, causes vzmemcheck to return incorrect data

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major
    • Resolution: Fixed
    • Fix Version/s: OpenVZ-legacy
    • Component/s: Containers::Userspace
    • Security Level: Public
    • Environment:
      Operating System: RHEL/CentOS 5
      Platform: x86_64 (AMD64)

      Description

      Attempting to calculate utilisation, commitment & limit indices for our OpenVZ HNs is inaccurate given that vzmemcheck is including containers that are deleted:

      pdzwart@atlassian45:~/OpenVZ[23:16:08](0,0)$ sudo /usr/sbin/vzlist
      Container(s) not found
      pdzwart@atlassian45:~/OpenVZ[23:16:32](0,1)$ sudo /usr/sbin/vzmemcheck -v
      Output values in %
      veid LowMem LowMem RAM MemSwap MemSwap Alloc Alloc Alloc
                    util commit util util commit util commit limit
      1031 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1005 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1010 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1017 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1016 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1015 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1014 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1013 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1012 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1009 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      1008 0.00 13.13 0.00 0.00 12.70 0.00 12.70 57.78
      -------------------------------------------------------------------------
      Summary: 0.00 144.38 0.00 0.00 139.75 0.00 139.75 635.53
      pdzwart@atlassian45:~/OpenVZ[23:16:35](0,0)$ sudo /usr/sbin/vzctl status 1031
      CTID 1031 deleted unmounted down
      pdzwart@atlassian45:~/OpenVZ[23:21:26](0,0)$

      There are also entries in both /proc/user_beancounters and /proc/bc/${CTID}/ that show allocations held:

      pdzwart@atlassian45:~/OpenVZ[23:24:56](0,0)$ export CTID=1010
      pdzwart@atlassian45:~/OpenVZ[23:25:44](0,0)$ sudo egrep -A23 "${CTID}:" /proc/user_beancounters
           1010: kmemsize 45366 20936957 841784627 925963089 0
                  lockedpages 0 0 41102 41102 0
                  privvmpages 0 1438840 4476066 4923672 0
                  shmpages 0 3872 447606 447606 0
                  dummy 0 0 0 0 0
                  numproc 0 321 20550 20550 0
                  physpages 0 685472 0 9223372036854775807 0
                  vmguarpages 0 0 746011 9223372036854775807 0
                  oomguarpages 0 685472 746011 9223372036854775807 0
                  numtcpsock 0 430 20550 20550 0
                  numflock 0 10 1000 1100 0
                  numpty 0 6 512 512 0
                  numsiginfo 0 11 1024 1024 0
                  tcpsndbuf 156608 30611240 196422075 280594875 0
                  tcprcvbuf 0 117644640 196422075 280594875 0
                  othersockbuf 0 295872 98211037 182383837 0
                  dgramrcvbuf 0 28400 98211037 98211037 0
                  numothersock 0 41 20550 20550 0
                  dcachesize 3756 62913 183872621 189388800 0
                  numfile 0 7046 328800 328800 0
                  dummy 0 0 0 0 0
                  dummy 0 0 0 0 0
                  dummy 0 0 0 0 0
                  numiptent 0 14 200 200 0
      pdzwart@atlassian45:~/OpenVZ[23:25:48](0,0)$ sudo cat /proc/bc/${CTID}/resources
                  kmemsize 45366 20936957 841784627 925963089 0
                  lockedpages 0 0 41102 41102 0
                  privvmpages 0 1438840 4476066 4923672 0
                  shmpages 0 3872 447606 447606 0
                  numproc 0 321 20550 20550 0
                  physpages 0 685472 0 9223372036854775807 0
                  vmguarpages 0 0 746011 9223372036854775807 0
                  oomguarpages 0 685472 746011 9223372036854775807 0
                  numtcpsock 0 430 20550 20550 0
                  numflock 0 10 1000 1100 0
                  numpty 0 6 512 512 0
                  numsiginfo 0 11 1024 1024 0
                  tcpsndbuf 156608 30611240 196422075 280594875 0
                  tcprcvbuf 0 117644640 196422075 280594875 0
                  othersockbuf 0 295872 98211037 182383837 0
                  dgramrcvbuf 0 28400 98211037 98211037 0
                  numothersock 0 41 20550 20550 0
                  dcachesize 3756 62913 183872621 189388800 0
                  numfile 0 7046 328800 328800 0
                  numiptent 0 14 200 200 0
                  swappages 0 0 9223372036854775807 9223372036854775807 0
      pdzwart@atlassian45:~/OpenVZ[23:25:58](0,0)$ sudo /usr/sbin/vzctl status ${CTID}
      CTID 1010 deleted unmounted down
      pdzwart@atlassian45:~/OpenVZ[23:26:04](0,0)$

      Package versions are as follows:

      pdzwart@atlassian45:~/OpenVZ[23:26:26](0,1)$ rpm -qa |grep vz
      vzctl-lib-3.0.24.1-1
      vzpkg-2.7.0-18
      vztmpl-fedora-9-1.1-1
      vzrpm44-4.4.1-22.5
      vztmpl-fedora-core-3-2.0-2
      vzctl-3.0.24.1-1
      vzrpm43-python-4.3.3-7_nonptl.6
      vztmpl-centos-5-2.0-3
      vztmpl-fedora-core-6-1.2-1
      vzquota-3.0.12-1
      vzrpm44-python-4.4.1-22.5
      vztmpl-centos-4-2.0-2
      vztmpl-fedora-core-5-2.0-2
      vzrpm43-4.3.3-7_nonptl.6
      vztmpl-fedora-core-4-2.0-2
      ovzkernel-2.6.18-194.8.1.el5.028stab070.2
      vzyum-2.4.0-11
      vztmpl-fedora-7-1.1-1
      pdzwart@atlassian45:~/OpenVZ[23:26:36](0,0)$

      Our workaround is a python script that parses /proc/user_beancounters correlating running containers from a vzlist execution.

        Activity

        Hide
        kir Kir Kolyshkin added a comment -

        Pete,

        Thanks for reporting! Please specify what kernel are you using?

        Note that held!=0 in beancounters means leak(s), and this is most probably the kernel bug, thus reassigning to kernel.

        From the tools point of view, maybe it makes sense to exclude stopped containers beancounters, but this is usually not an issue, because beancounters for stopped containers are down to zero and thus removed.

        Show
        kir Kir Kolyshkin added a comment - Pete, Thanks for reporting! Please specify what kernel are you using? Note that held!=0 in beancounters means leak(s), and this is most probably the kernel bug, thus reassigning to kernel. From the tools point of view, maybe it makes sense to exclude stopped containers beancounters, but this is usually not an issue, because beancounters for stopped containers are down to zero and thus removed.
        Hide
        pdzwart@atlassian.com Pete de Zwart added a comment -

        From /var/log/dmesg:

        Linux version 2.6.18-164.15.1.el5.028stab068.9 (root@rhel5-build-x64) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Tue Mar 30 18:07:38 MSD 2010

        From /boot/grub/menu.list

        title Red Hat Enterprise Linux Server (2.6.18-164.15.1.el5.028stab068.9)
        root (hd0,0)
        kernel /vmlinuz-2.6.18-164.15.1.el5.028stab068.9 ro root=LABEL=/ rhgb quiet elevator=deadline selinux=0
        initrd /initrd-2.6.18-164.15.1.el5.028stab068.9.img

        Show
        pdzwart@atlassian.com Pete de Zwart added a comment - From /var/log/dmesg: Linux version 2.6.18-164.15.1.el5.028stab068.9 (root@rhel5-build-x64) (gcc version 4.1.2 20080704 (Red Hat 4.1.2-46)) #1 SMP Tue Mar 30 18:07:38 MSD 2010 From /boot/grub/menu.list title Red Hat Enterprise Linux Server (2.6.18-164.15.1.el5.028stab068.9) root (hd0,0) kernel /vmlinuz-2.6.18-164.15.1.el5.028stab068.9 ro root=LABEL=/ rhgb quiet elevator=deadline selinux=0 initrd /initrd-2.6.18-164.15.1.el5.028stab068.9.img
        Hide
        kir Kir Kolyshkin added a comment -

        Pete,

        This kernel, 028stab068.9, was released at the end of March 2010, ie about 10 months ago. It is old and buggy. I suggest you to update to the latest rhel5 stable kernel (http://wiki.openvz.org/Download/kernel/rhel5) and give it a try. If you will see UBC leaks with the latest kernel, please file a separate bug for that.

        As for the tools, indeed we need to filter out beancounters for running CTs only – reassigning the bug back to vzctl. I will work on that.

        Show
        kir Kir Kolyshkin added a comment - Pete, This kernel, 028stab068.9, was released at the end of March 2010, ie about 10 months ago. It is old and buggy. I suggest you to update to the latest rhel5 stable kernel ( http://wiki.openvz.org/Download/kernel/rhel5 ) and give it a try. If you will see UBC leaks with the latest kernel, please file a separate bug for that. As for the tools, indeed we need to filter out beancounters for running CTs only – reassigning the bug back to vzctl. I will work on that.
        Hide
        pdzwart@atlassian.com Pete de Zwart added a comment -

        Kir, we'll upgrade to the latest stable and advise if it resolves the issue with UBCs for deleted unmounted down containers.

        Show
        pdzwart@atlassian.com Pete de Zwart added a comment - Kir, we'll upgrade to the latest stable and advise if it resolves the issue with UBCs for deleted unmounted down containers.
        Hide
        kir Kir Kolyshkin added a comment -

        It makes sense to upgrade your kernels from time to time anyway – fixed bugs and security holes, better performance, etc. Please do so. If you don't want to stop all CTs, use live migration (migrate out/reboot/migrate back).

        As for the bug itself, I have fixed it. The following GIT commits are relevant:
        http://git.openvz.org/?p=vzctl;a=commit;h=24cc0e560700d568175112ea4cc4895c91f26504
        http://git.openvz.org/?p=vzctl;a=commit;h=0517647e2e91c934142b8f59112e2ca4df762334
        http://git.openvz.org/?p=vzctl;a=commit;h=d94974e1f9be2dfae11cee75fdb67dd44066cf9b

        Plus two cosmetic fixes to vzmemcheck while we're at it:
        http://git.openvz.org/?p=vzctl;a=commit;h=2ffb55af838c4c8989cbf0bccf480dcaf55d5f19
        http://git.openvz.org/?p=vzctl;a=commit;h=383e2756bc406e85d7f1d5e0024062f19599285e

        Fix will appear in vzctl >= 3.0.26 (which I hope to release some time this month, better sooner than later).

        Show
        kir Kir Kolyshkin added a comment - It makes sense to upgrade your kernels from time to time anyway – fixed bugs and security holes, better performance, etc. Please do so. If you don't want to stop all CTs, use live migration (migrate out/reboot/migrate back). As for the bug itself, I have fixed it. The following GIT commits are relevant: http://git.openvz.org/?p=vzctl;a=commit;h=24cc0e560700d568175112ea4cc4895c91f26504 http://git.openvz.org/?p=vzctl;a=commit;h=0517647e2e91c934142b8f59112e2ca4df762334 http://git.openvz.org/?p=vzctl;a=commit;h=d94974e1f9be2dfae11cee75fdb67dd44066cf9b Plus two cosmetic fixes to vzmemcheck while we're at it: http://git.openvz.org/?p=vzctl;a=commit;h=2ffb55af838c4c8989cbf0bccf480dcaf55d5f19 http://git.openvz.org/?p=vzctl;a=commit;h=383e2756bc406e85d7f1d5e0024062f19599285e Fix will appear in vzctl >= 3.0.26 (which I hope to release some time this month, better sooner than later).
        Hide
        sergeyb Sergey Bronnikov added a comment -

        Bug was fixed more than one year ago and there were no complains from reporter after fix. We believe bug fix helped and mark bug as closed.

        Show
        sergeyb Sergey Bronnikov added a comment - Bug was fixed more than one year ago and there were no complains from reporter after fix. We believe bug fix helped and mark bug as closed.

          People

          • Assignee:
            kir Kir Kolyshkin
            Reporter:
            pdzwart@atlassian.com Pete de Zwart
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated:
              Resolved: