Details
-
Type: Bug
-
Status: Open
-
Priority: Major
-
Resolution: Unresolved
-
Fix Version/s: Vz7.0-Update-next
-
Component/s: Containers::Kernel
-
Security Level: Public
-
Environment:3.10.0-327.3.1.vz7.10.15
ploop-7.0.21-1.vz7.x86_64
vzctl-7.0.87-1.vz7.x86_64
Description
>Description of problem:
Container's process may stuck (enter D process) seldomly. Meanwhile backtrace appears in dmesg
>How reproducible:
I can reproduce it by running a java minecraft server. Many other processes like mysql, apt-get, nginx etc. have encountered this problem, but minecraft server is the only method to reproduce it stably(may due to its certain I/O workload).
>Steps to Reproduce:
1. vzctl start 101
2. vzctl enter 101
3. apt-get install openjdk-7-jre
4. java -jar server.jar
5. play with the minecraft server
Both ploop and simfs based container can reproduce it.
With ploop, process entered D status causes entire container stuck.
With simfs, the stucked process won't affect other process.
backtrace:
Apr 22 00:49:46 scgyshell-1 kernel: INFO: task java:177716 blocked for more than 120 seconds.
Apr 22 00:49:46 scgyshell-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 00:49:46 scgyshell-1 kernel: java D ffff880111448000 0 177716 169372 7 0x00000004
Apr 22 00:49:46 scgyshell-1 kernel: ffff88010ddd7af0 0000000000000086 ffff880111448000 ffff88010ddd7fd8
Apr 22 00:49:46 scgyshell-1 kernel: ffff88010ddd7fd8 ffff88010ddd7fd8 0000000000000007 ffff8804495d6580
Apr 22 00:49:46 scgyshell-1 kernel: ffff88045fcdec40 0000000000000000 7fffffffffffffff ffffffff8117ae20
Apr 22 00:49:46 scgyshell-1 kernel: Call Trace:
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae20>] ? wait_on_page_read+0x60/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81635d69>] schedule+0x29/0x70
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81633b39>] schedule_timeout+0x239/0x2d0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa0470e5c>] ? __ext4_journal_stop+0x3c/0xb0 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa043fcc9>] ? ext4_da_write_end+0x139/0x2e0 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae20>] ? wait_on_page_read+0x60/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8163544e>] io_schedule_timeout+0xae/0x130
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff816354e8>] io_schedule+0x18/0x1a
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae2e>] sleep_on_page+0xe/0x20
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81633c90>] __wait_on_bit+0x60/0x90
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117abb6>] wait_on_page_bit+0x86/0xb0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff810a8790>] ? wake_atomic_t_function+0x40/0x40
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8118c80b>] truncate_inode_pages_range+0x3bb/0x740
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff810b2c04>] ? __wake_up+0x44/0x50
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa017325a>] ? jbd2_journal_stop+0x1ea/0x3d0 [jbd2]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa0449f94>] ? ext4_unlink+0x304/0x390 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8125ddda>] ? __dquot_initialize+0x3a/0x1c0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8118cc0e>] truncate_inode_pages_final+0x5e/0x90
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa043f40c>] ext4_evict_inode+0x10c/0x520 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff812163d7>] evict+0xa7/0x170
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81216cbb>] iput+0x18b/0x1f0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8120b26e>] do_unlinkat+0x1ae/0x2b0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff811ffa0e>] ? SYSC_newstat+0x3e/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8120c276>] SyS_unlink+0x16/0x20
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81640e49>] system_call_fastpath+0x16/0x1b
PS: I use ceph rbd disk to hold container's data. I am testing whether it can be reproduced if container stored on local disk.
>Host OS:
Centos7
>Guest OS:
ubuntu14.04
>Additional info (see https://openvz.org/Reporting_OpenVZ_problem):
Container's process may stuck (enter D process) seldomly. Meanwhile backtrace appears in dmesg
>How reproducible:
I can reproduce it by running a java minecraft server. Many other processes like mysql, apt-get, nginx etc. have encountered this problem, but minecraft server is the only method to reproduce it stably(may due to its certain I/O workload).
>Steps to Reproduce:
1. vzctl start 101
2. vzctl enter 101
3. apt-get install openjdk-7-jre
4. java -jar server.jar
5. play with the minecraft server
Both ploop and simfs based container can reproduce it.
With ploop, process entered D status causes entire container stuck.
With simfs, the stucked process won't affect other process.
backtrace:
Apr 22 00:49:46 scgyshell-1 kernel: INFO: task java:177716 blocked for more than 120 seconds.
Apr 22 00:49:46 scgyshell-1 kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
Apr 22 00:49:46 scgyshell-1 kernel: java D ffff880111448000 0 177716 169372 7 0x00000004
Apr 22 00:49:46 scgyshell-1 kernel: ffff88010ddd7af0 0000000000000086 ffff880111448000 ffff88010ddd7fd8
Apr 22 00:49:46 scgyshell-1 kernel: ffff88010ddd7fd8 ffff88010ddd7fd8 0000000000000007 ffff8804495d6580
Apr 22 00:49:46 scgyshell-1 kernel: ffff88045fcdec40 0000000000000000 7fffffffffffffff ffffffff8117ae20
Apr 22 00:49:46 scgyshell-1 kernel: Call Trace:
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae20>] ? wait_on_page_read+0x60/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81635d69>] schedule+0x29/0x70
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81633b39>] schedule_timeout+0x239/0x2d0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa0470e5c>] ? __ext4_journal_stop+0x3c/0xb0 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa043fcc9>] ? ext4_da_write_end+0x139/0x2e0 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae20>] ? wait_on_page_read+0x60/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8163544e>] io_schedule_timeout+0xae/0x130
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff816354e8>] io_schedule+0x18/0x1a
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117ae2e>] sleep_on_page+0xe/0x20
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81633c90>] __wait_on_bit+0x60/0x90
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8117abb6>] wait_on_page_bit+0x86/0xb0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff810a8790>] ? wake_atomic_t_function+0x40/0x40
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8118c80b>] truncate_inode_pages_range+0x3bb/0x740
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff810b2c04>] ? __wake_up+0x44/0x50
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa017325a>] ? jbd2_journal_stop+0x1ea/0x3d0 [jbd2]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa0449f94>] ? ext4_unlink+0x304/0x390 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8125ddda>] ? __dquot_initialize+0x3a/0x1c0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8118cc0e>] truncate_inode_pages_final+0x5e/0x90
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffffa043f40c>] ext4_evict_inode+0x10c/0x520 [ext4]
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff812163d7>] evict+0xa7/0x170
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81216cbb>] iput+0x18b/0x1f0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8120b26e>] do_unlinkat+0x1ae/0x2b0
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff811ffa0e>] ? SYSC_newstat+0x3e/0x60
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff8120c276>] SyS_unlink+0x16/0x20
Apr 22 00:49:46 scgyshell-1 kernel: [<ffffffff81640e49>] system_call_fastpath+0x16/0x1b
PS: I use ceph rbd disk to hold container's data. I am testing whether it can be reproduced if container stored on local disk.
>Host OS:
Centos7
>Guest OS:
ubuntu14.04
>Additional info (see https://openvz.org/Reporting_OpenVZ_problem):