Re: [patch 0/7] improve memcg oom killer robustness v2

linux-mm.kvack.org archive mirror
 help / color / mirror / Atom feed

From: "azurIt" <azurit@pobox.sk>
To: "Michal Hocko" <mhocko@suse.cz>
Cc: "Johannes Weiner" <hannes@cmpxchg.org>,
	"Andrew Morton" <akpm@linux-foundation.org>,
	"David Rientjes" <rientjes@google.com>,
	"KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>,
	"KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>,
	linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org,
	linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org
Subject: Re: [patch 0/7] improve memcg oom killer robustness v2
Date: Mon, 16 Sep 2013 16:01:19 +0200	[thread overview]
Message-ID: <20130916160119.2E76C2A1@pobox.sk> (raw)
In-Reply-To: <20130916134014.GA3674@dhcp22.suse.cz>

> CC: "Johannes Weiner" <hannes@cmpxchg.org>, "Andrew Morton" <akpm@linux-foundation.org>, "David Rientjes" <rientjes@google.com>, "KAMEZAWA Hiroyuki" <kamezawa.hiroyu@jp.fujitsu.com>, "KOSAKI Motohiro" <kosaki.motohiro@jp.fujitsu.com>, linux-mm@kvack.org, cgroups@vger.kernel.org, x86@kernel.org, linux-arch@vger.kernel.org, linux-kernel@vger.kernel.org
>On Sat 14-09-13 12:48:31, azurIt wrote:
>[...]
>> Here is the first occurence, this night between 5:15 and 5:25:
>>  - this time i kept opened terminal from other server to this problematic one with htop running
>>  - when server went down i opened it and saw one process of one user running at the top and taking 97% of CPU (cgroup 1304)
>
>I guess you do not have a stack trace(s) for that process? That would be
>extremely helpful.



I'm afraid it won't be possible as server is completely not responding when it happens. Anyway, i don't think it was a fault of one process or one user.




>>  - everything was stucked so that htop didn't help me much
>>  - luckily, my new 'load check' script, which i was mentioning before, was able to kill apache and everything went to normal (success with it's very first version, wow ;) )
>>  - i checked some other logs and everything seems to point to cgroup 1304, also kernel log at 5:14-15 is showing hard OOM in that cgroup:
>> http://watchdog.sk/lkml/kern7.log
>
>I am not sure what you mean by hard OOM because there is no global OOM
>in that log:
>$ grep "Kill process" kern7.log | sed 's@.*]\(.*Kill process\>\).*@\1@' | sort -u
> Memory cgroup out of memory: Kill process
>
>But you had a lot of memcg OOMs in that group (1304) during that time
>(and even earlier):



I meant OOM inside cgroup 1304. I'm sure this cgroup created the problem.




>$ grep "\<1304\>" kern7.log 
>Sep 14 05:03:45 server01 kernel: [188287.778020] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:03:46 server01 kernel: [188287.871427] [30433]  1304 30433   181781    66426   7       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.871594] [30808]  1304 30808   169111    53866   4       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.871742] [30809]  1304 30809   181168    65992   2       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.871890] [30811]  1304 30811   168684    53399   3       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.872041] [30814]  1304 30814   181102    65924   3       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.872189] [30815]  1304 30815   168814    53451   4       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.877731] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30808]  1304 30808   169111    53918   3       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30809]  1304 30809   181168    65992   2       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30811]  1304 30811   168684    53399   3       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30814]  1304 30814   181102    65924   3       0             0 apache2
>Sep 14 05:03:46 server01 kernel: [188287.973155] [30815]  1304 30815   168815    53558   0       0             0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.137540] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:03:47 server01 kernel: [188289.231873] [30809]  1304 30809   182662    67534   7       0             0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.232021] [30811]  1304 30811   171920    56781   4       0             0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.232171] [30814]  1304 30814   182596    67470   3       0             0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.232319] [30815]  1304 30815   171920    56778   1       0             0 apache2
>Sep 14 05:03:47 server01 kernel: [188289.232478] [30896]  1304 30896   171918    56761   0       0             0 apache2
>[...]
>Sep 14 05:14:00 server01 kernel: [188902.666893] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:14:00 server01 kernel: [188902.742928] [ 7806]  1304  7806   178891    64008   6       0             0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743080] [ 7910]  1304  7910   175318    60302   2       0             0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743228] [ 7911]  1304  7911   174943    59878   1       0             0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743376] [ 7912]  1304  7912   171568    56404   3       0             0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743524] [ 7914]  1304  7914   174911    59879   5       0             0 apache2
>Sep 14 05:14:00 server01 kernel: [188902.743673] [ 7915]  1304  7915   173472    58386   2       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.249749] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7910]  1304  7910   176278    61211   6       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7911]  1304  7911   176278    61211   7       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7912]  1304  7912   173732    58655   3       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7914]  1304  7914   176269    61211   7       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7915]  1304  7915   176269    61211   7       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.336276] [ 7966]  1304  7966   170385    55164   7       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.340992] Task in /1304/uid killed as a result of limit of /1304
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7911]  1304  7911   176340    61332   2       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7912]  1304  7912   173996    58901   1       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7914]  1304  7914   176331    61331   4       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7915]  1304  7915   176331    61331   2       0             0 apache2
>Sep 14 05:14:02 server01 kernel: [188904.424284] [ 7966]  1304  7966   170385    55164   7       0             0 apache2
>[...]
>
>The only thing that is clear from this is that there is always one
>process killed and a new one is spawned and that leads to the same
>out of memory situation. So this is precisely what Johannes already
>described as a Hydra load.



I can't do anything with this, the processes are visitors on web sites of that user.




>There is a silence in the logs:
>Sep 14 05:14:39 server01 kernel: [188940.869639] Killed process 8453 (apache2) total-vm:710732kB, anon-rss:245680kB, file-rss:4588kB
>Sep 14 05:21:24 server01 kernel: [189344.518699] grsec: From 95.103.217.66: failed fork with errno EAGAIN by /bin/dash[sh:10362] uid/euid:1387/1387 g
>id/egid:100/100, parent /usr/sbin/cron[cron:10144] uid/euid:0/0 gid/egid:0/0
>
>Myabe that is what you are referring to as a stuck situation. Is pid
>8453 the task you have seen consuming the CPU? If yes, then we would
>need a stack for that task to find out what is going on.




Unfortunately i don't know the PID but i don't think it's important. I just wanted to tell that cgroup 1304 was doing problem in this particular case (there were several signes pointing to it). As you can see in the logs, too much memcg OOM is creating huge I/O which is taking down the whole server for no reason.

The same thing is happennig several times per day *if* i'm running kernel with Joahnnes latest patch.

azur

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to majordomo@kvack.org.  For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"dont@kvack.org"> email@kvack.org </a>

next prev parent reply	other threads:[~2013-09-16 14:01 UTC|newest]

Thread overview: 99+ messages / expand[flat|nested]  mbox.gz  Atom feed  top
2013-08-03 16:59 Johannes Weiner
2013-08-03 16:59 ` [patch 1/7] arch: mm: remove obsolete init OOM protection Johannes Weiner
2013-08-06  6:34   ` Vineet Gupta
2013-08-03 16:59 ` [patch 2/7] arch: mm: do not invoke OOM killer on kernel fault OOM Johannes Weiner
2013-08-03 16:59 ` [patch 3/7] arch: mm: pass userspace fault flag to generic fault handler Johannes Weiner
2013-08-05 22:06   ` Andrew Morton
2013-08-05 22:25     ` Johannes Weiner
2013-08-03 16:59 ` [patch 4/7] x86: finish user fault error path with fatal signal Johannes Weiner
2013-08-03 16:59 ` [patch 5/7] mm: memcg: enable memcg OOM killer only for user faults Johannes Weiner
2013-08-05  9:18   ` Michal Hocko
2013-08-03 16:59 ` [patch 6/7] mm: memcg: rework and document OOM waiting and wakeup Johannes Weiner
2013-08-03 17:00 ` [patch 7/7] mm: memcg: do not trap chargers with full callstack on OOM Johannes Weiner
2013-08-05  9:54   ` Michal Hocko
2013-08-05 20:56     ` Johannes Weiner
2013-08-03 17:08 ` [patch 0/7] improve memcg oom killer robustness v2 Johannes Weiner
2013-08-09  9:06   ` azurIt
2013-08-30 19:58   ` azurIt
2013-09-02 10:38     ` azurIt
2013-09-03 20:48       ` Johannes Weiner
2013-09-04  7:53         ` azurIt
2013-09-04  8:18         ` azurIt
2013-09-05 11:54           ` Johannes Weiner
2013-09-05 12:43             ` Michal Hocko
2013-09-05 16:18               ` Johannes Weiner
2013-09-09 12:36                 ` Michal Hocko
2013-09-09 12:56                   ` Michal Hocko
2013-09-12 12:59                     ` Johannes Weiner
2013-09-16 14:03                       ` Michal Hocko
2013-09-05 13:24             ` Michal Hocko
2013-09-09 13:10             ` azurIt
2013-09-09 17:28               ` Johannes Weiner
2013-09-09 19:59                 ` azurIt
2013-09-09 20:12                   ` Johannes Weiner
2013-09-09 20:18                     ` azurIt
2013-09-09 21:08                     ` azurIt
2013-09-10 18:13                     ` azurIt
2013-09-10 18:37                       ` Johannes Weiner
2013-09-10 19:32                         ` azurIt
2013-09-10 20:12                           ` Johannes Weiner
2013-09-10 21:08                             ` azurIt
2013-09-10 21:18                               ` Johannes Weiner
2013-09-10 21:32                                 ` azurIt
2013-09-10 22:03                                   ` Johannes Weiner
2013-09-11 12:33                                     ` azurIt
2013-09-11 18:03                                       ` Johannes Weiner
2013-09-11 18:54                                         ` azurIt
2013-09-11 19:11                                           ` Johannes Weiner
2013-09-11 19:41                                             ` azurIt
2013-09-11 20:04                                               ` Johannes Weiner
2013-09-14 10:48                                                 ` azurIt
2013-09-16 13:40                                                   ` Michal Hocko
2013-09-16 14:01                                                     ` azurIt [this message]
2013-09-16 14:06                                                       ` Michal Hocko
2013-09-16 14:13                                                         ` azurIt
2013-09-16 14:57                                                           ` Michal Hocko
2013-09-16 15:05                                                             ` azurIt
2013-09-16 15:17                                                               ` Johannes Weiner
2013-09-16 15:24                                                                 ` azurIt
2013-09-16 15:25                                                               ` Michal Hocko
2013-09-16 15:40                                                                 ` azurIt
2013-09-16 20:52                                                                 ` azurIt
2013-09-17  0:02                                                                   ` Johannes Weiner
2013-09-17 11:15                                                                     ` azurIt
2013-09-17 14:10                                                                       ` Michal Hocko
2013-09-18 14:03                                                                         ` azurIt
2013-09-18 14:24                                                                           ` Michal Hocko
2013-09-18 14:33                                                                             ` azurIt
2013-09-18 14:42                                                                               ` Michal Hocko
2013-09-18 18:02                                                                                 ` azurIt
2013-09-18 18:36                                                                                   ` Michal Hocko
     [not found]                                                                           ` <20130918160304.6EDF2729-Rm0zKEqwvD4@public.gmane.org>
2013-09-18 18:04                                                                             ` Johannes Weiner
     [not found]                                                                               ` <20130918180455.GD856-druUgvl0LCNAfugRpC6u6w@public.gmane.org>
2013-09-18 18:19                                                                                 ` Johannes Weiner
2013-09-18 19:55                                                                                   ` Johannes Weiner
2013-09-18 20:52                                                                                     ` azurIt
2013-09-25  7:26                                                                                     ` azurIt
2013-09-26 16:54                                                                                     ` azurIt
2013-09-26 19:27                                                                                       ` Johannes Weiner
2013-09-27  2:04                                                                                         ` azurIt
2013-10-07 11:01                                                                                         ` azurIt
     [not found]                                                                                           ` <20131007130149.5F5482D8-Rm0zKEqwvD4@public.gmane.org>
2013-10-07 19:23                                                                                             ` Johannes Weiner
2013-10-09 18:44                                                                                               ` azurIt
2013-10-10  0:14                                                                                                 ` Johannes Weiner
2013-10-10 22:59                                                                                                   ` azurIt
2013-09-17 11:20                                                                     ` azurIt
2013-09-16 10:22                                                 ` azurIt
2013-09-04  9:45         ` azurIt
2013-09-04 11:57           ` Michal Hocko
2013-09-04 12:10             ` azurIt
2013-09-04 12:26               ` Michal Hocko
2013-09-04 12:39                 ` azurIt
2013-09-05  9:14                 ` azurIt
2013-09-05  9:53                   ` Michal Hocko
2013-09-05 10:17                     ` azurIt
2013-09-05 11:17                       ` Michal Hocko
2013-09-05 11:47                         ` azurIt
2013-09-05 12:03                           ` Michal Hocko
2013-09-05 12:33                             ` azurIt
2013-09-05 12:45                               ` Michal Hocko
2013-09-05 13:00                                 ` azurIt

Reply instructions:

You may reply publicly to this message via plain-text email
using any one of the following methods:

* Save the following mbox file, import it into your mail client,
  and reply-to-all from there: mbox

  Avoid top-posting and favor interleaved quoting:
  https://en.wikipedia.org/wiki/Posting_style#Interleaved_style

* Reply using the --to, --cc, and --in-reply-to
  switches of git-send-email(1):

  git send-email \
    --in-reply-to=20130916160119.2E76C2A1@pobox.sk \
    --to=azurit@pobox.sk \
    --cc=akpm@linux-foundation.org \
    --cc=cgroups@vger.kernel.org \
    --cc=hannes@cmpxchg.org \
    --cc=kamezawa.hiroyu@jp.fujitsu.com \
    --cc=kosaki.motohiro@jp.fujitsu.com \
    --cc=linux-arch@vger.kernel.org \
    --cc=linux-kernel@vger.kernel.org \
    --cc=linux-mm@kvack.org \
    --cc=mhocko@suse.cz \
    --cc=rientjes@google.com \
    --cc=x86@kernel.org \
    /path/to/YOUR_REPLY

  https://kernel.org/pub/software/scm/git/docs/git-send-email.html

* If your mail client supports setting the In-Reply-To header
  via mailto: links, try the mailto: link

Be sure your reply has a Subject: header at the top and a blank line before the message body.

This is a public inbox, see mirroring instructions
for how to clone and mirror all data and code used for this inbox