Details
-
Type: Bug
-
Status: Resolved
-
Priority: Major
-
Resolution: Fixed
-
Fix Version/s: Vz7.0-RTM
-
Component/s: Containers::Kernel
-
Security Level: Public
Description
Description of problem:
We encounter an issue with mem_cgroup_uncharge_page() function,
it appears quite often on our clients servers.
Basically the issue sometimes leads to hard-lockup, sometimes to GP fault.
Based on bug reports from clients, the problem shows up when a user
process calls "execve" or "exit" syscalls.
As we know in those cases kernel invokes "uncharging" for every page
when its unmapped from all the mm's.
Kernel dump analysis shows that at the moment of
mem_cgroup_uncharge_page() "memcg" pointer
(taken from page_cgroup) seems to be pointing to some random memory area.
On the other hand, if we look at current->mm->mm_ub->css, then memcg instance
exists and is "online".
This led me to a thought that "page_cgroup->memcg" may be changed by
some part of memcg code in parallel.
As far as i understand, the only option here is "reclaim code path"
(may be i'm wrong)
So, i suppose there might be a race between "memcg uncharge code" and
"memcg reclaim code".
Actual results:
Kernel panic/lockup
We encounter an issue with mem_cgroup_uncharge_page() function,
it appears quite often on our clients servers.
Basically the issue sometimes leads to hard-lockup, sometimes to GP fault.
Based on bug reports from clients, the problem shows up when a user
process calls "execve" or "exit" syscalls.
As we know in those cases kernel invokes "uncharging" for every page
when its unmapped from all the mm's.
Kernel dump analysis shows that at the moment of
mem_cgroup_uncharge_page() "memcg" pointer
(taken from page_cgroup) seems to be pointing to some random memory area.
On the other hand, if we look at current->mm->mm_ub->css, then memcg instance
exists and is "online".
This led me to a thought that "page_cgroup->memcg" may be changed by
some part of memcg code in parallel.
As far as i understand, the only option here is "reclaim code path"
(may be i'm wrong)
So, i suppose there might be a race between "memcg uncharge code" and
"memcg reclaim code".
Actual results:
Kernel panic/lockup
Attachments
Issue Links
- is duplicated by
-
OVZ-6765 Kernel panic at res_counter_uncharge_until
- Resolved