Uploaded image for project: 'OpenVZ'
  1. OpenVZ
  2. OVZ-7063

Active SSH Session to Container is Lost when Suspend/Resume Container

    Details

    • Type: Bug
    • Status: Open
    • Priority: Major
    • Resolution: Unresolved
    • Fix Version/s: Vz7.0-Update-next
    • Component/s: CRIU
    • Security Level: Public

      Description

      >Description of problem:
      taken from
      https://forum.openvz.org/index.php?t=rview&goto=53447&th=13580#msg_53447

      Dear All,

      I am trying to install OpenVZ on a RedHat 7.5 machine (AWS instance). I have successfully installed it and its utilities (with yum mainly, openvz-release-7.0.8-4.vz7.x86_64) and I am able to start a container, suspend, and resume it. The container launched is Centos 7.

      [root@ip-172-31-7-139 ec2-user_scripts]# uname -r
      3.10.0-862.11.6.vz7.64.7

      When I launch a container "ct1", I can ssh to it and run something simple like the top command "top -d 1" which updates every second. If I suspend the container while I have that active ssh session to ct1, the shell with the top command obviously freezes. My expectation when I resume the container is that the shell and top command should resume, however instead the ssh and tcp connection seem to break on that shell where I ssh'ed to ct1 and I end up getting kicked out of the container when the container is resumed. I can do a new ssh connection to the container after it resumes with no problem. However my expectation is that I shouldn't be losing the ssh connection in the first place, it should just hang during the suspend/resume then continue normally after resume is done. The error I get is "packet_write_wait: Connection to ct1 port 22: Broken pipe".

      I googled the error and tried to adjust some ssh configuration parameters from both the client and server side but none helped, for example I adjusted these:
      - ServerAliveInterval
      - ServerAliveCountMax
      - ClientAliveInterval

      I also tried to adjust some kernel parameters on the hardware node and container as I've seen suggestions on line to do so, but that didn't help either, these are the ones adjusted:
      net.ipv4.tcp_keepalive_time = 7200
      net.ipv4.tcp_keepalive_intvl = 300
      net.ipv4.tcp_keepalive_probes = 100


      Are they any settings to adjust on the hardware node (e.g. kernel parameters) or when launching the container to fix this behavior? Any suggestions are really appreciated.

      This is the summary of the scenario I am describing:

      #Create container
      prlctl create ct1 --vmtype ct
      #setup network, dns, etc..
      #ssh to ct1 and run simple command like "top -d 1"
      #Suspend container while active ssh session is taking place
      prlctl suspend ct1
      #Resume container
      prlctl resume ct1
      #Expectation: ssh session should still be intact after resuming container, instead it gets broken

      Thank you for your help.

      Mohamad Sindi
      Massachusetts Institute of Technology (MIT)

        Activity

        Hide
        tomvb Tom VB added a comment -

        Note that this concerns CentOS 7.5 and not the OpenVZ installation.

        "I am trying to install OpenVZ on a RedHat 7.5 machine (AWS instance). I have successfully installed it and its utilities (with yum mainly, openvz-release-7.0.8-4.vz7.x86_64) and I am able to start a container, suspend, and resume it. The container launched is Centos 7."

        Show
        tomvb Tom VB added a comment - Note that this concerns CentOS 7.5 and not the OpenVZ installation. "I am trying to install OpenVZ on a RedHat 7.5 machine (AWS instance). I have successfully installed it and its utilities (with yum mainly, openvz-release-7.0.8-4.vz7.x86_64) and I am able to start a container, suspend, and resume it. The container launched is Centos 7."
        Hide
        sindimo Mohamad S. added a comment -

        Thank you for the clarification and pointing me to the supported OpenVZ iso.

        I finally got an AWS instance installed with the latest OpenVZ release iso. Unfortunately the problem still remains even after installing with the supported OpenVZ iso. For reference, below are the versions used after doing a "yum update" on the system:

        [ec2-user@openvz-node ~]$ cat /etc/redhat-release
        Virtuozzo Linux release 7.5

        [ec2-user@openvz-node ~]$ uname -r
        3.10.0-862.14.4.vz7.72.4

        [ec2-user@openvz-node ~]$ rpm -qa | egrep " openvz-release|criu|prlctl|prl-disp-service|vzkernel|ploop|p ython-subprocess32|yum-plugin-priorities|libprlsdk "

        criu-3.10.0.7-1.vz7.x86_64
        libprlsdk-7.0.220-6.vz7.x86_64
        libprlsdk-python-7.0.220-6.vz7.x86_64
        openvz-release-7.0.9-2.vz7.x86_64
        ploop-7.0.131-1.vz7.x86_64
        ploop-lib-7.0.131-1.vz7.x86_64
        prlctl-7.0.156-1.vz7.x86_64
        prl-disp-service-7.0.863-1.vz7.x86_64
        prl-disp-service-tests-7.0.863-1.vz7.x86_64
        python-criu-3.10.0.7-1.vz7.x86_64
        python-ploop-7.0.131-1.vz7.x86_64
        python-subprocess32-3.2.7-1.vz7.5.x86_64
        vzkernel-3.10.0-862.14.4.vz7.72.4.x86_64
        yum-plugin-priorities-1.1.31-46.vl7.noarch

        I tried to investigate this further and I was able to figure out what's triggering the issue but not sure how to fix it.

        The container I am launching has an NFS4 mount inside it.

        If I disable that nfs mount and try to suspend/resume the container, then it works fine and the active ssh sessions to the container resumes fine once the resume operation is completed.

        However if I kept the nfs mount inside the container and I try to suspend/resume the container, any active ssh session to the container gets broken after resume is done (broken pipe error). Please note that once the container is resumed, I am able to establish a new ssh session to it and the nfs mount inside it is active and accessible and has no issues. So the nfs mount is successfully intact after resuming. It's just the fact that an nfs mount existing inside the container seems to be messing up restoring active ssh sessions once the resume is done.

        I hope this gives more insight to have the problem investigated further, and please if you have any suggestions to get around this I would truly appreciate your feedback.

        Many thanks for your help.

        Sincerely,

        Mohamad

        Show
        sindimo Mohamad S. added a comment - Thank you for the clarification and pointing me to the supported OpenVZ iso. I finally got an AWS instance installed with the latest OpenVZ release iso. Unfortunately the problem still remains even after installing with the supported OpenVZ iso. For reference, below are the versions used after doing a "yum update" on the system: [ec2-user@openvz-node ~] $ cat /etc/redhat-release Virtuozzo Linux release 7.5 [ec2-user@openvz-node ~] $ uname -r 3.10.0-862.14.4.vz7.72.4 [ec2-user@openvz-node ~] $ rpm -qa | egrep " openvz-release|criu|prlctl|prl-disp-service|vzkernel|ploop|p ython-subprocess32|yum-plugin-priorities|libprlsdk " criu-3.10.0.7-1.vz7.x86_64 libprlsdk-7.0.220-6.vz7.x86_64 libprlsdk-python-7.0.220-6.vz7.x86_64 openvz-release-7.0.9-2.vz7.x86_64 ploop-7.0.131-1.vz7.x86_64 ploop-lib-7.0.131-1.vz7.x86_64 prlctl-7.0.156-1.vz7.x86_64 prl-disp-service-7.0.863-1.vz7.x86_64 prl-disp-service-tests-7.0.863-1.vz7.x86_64 python-criu-3.10.0.7-1.vz7.x86_64 python-ploop-7.0.131-1.vz7.x86_64 python-subprocess32-3.2.7-1.vz7.5.x86_64 vzkernel-3.10.0-862.14.4.vz7.72.4.x86_64 yum-plugin-priorities-1.1.31-46.vl7.noarch I tried to investigate this further and I was able to figure out what's triggering the issue but not sure how to fix it. The container I am launching has an NFS4 mount inside it. If I disable that nfs mount and try to suspend/resume the container, then it works fine and the active ssh sessions to the container resumes fine once the resume operation is completed. However if I kept the nfs mount inside the container and I try to suspend/resume the container, any active ssh session to the container gets broken after resume is done (broken pipe error). Please note that once the container is resumed, I am able to establish a new ssh session to it and the nfs mount inside it is active and accessible and has no issues. So the nfs mount is successfully intact after resuming. It's just the fact that an nfs mount existing inside the container seems to be messing up restoring active ssh sessions once the resume is done. I hope this gives more insight to have the problem investigated further, and please if you have any suggestions to get around this I would truly appreciate your feedback. Many thanks for your help. Sincerely, Mohamad

          People

          • Assignee:
            khorenko Konstantin Khorenko
            Reporter:
            vvs Vasily Averin
          • Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

            Dates

            • Created:
              Updated: