From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-2.5 required=3.0 tests=MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 56690C33C99 for ; Fri, 10 Jan 2020 06:31:52 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 1EE52206DA for ; Fri, 10 Jan 2020 06:31:52 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 1EE52206DA Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 9E6068E0005; Fri, 10 Jan 2020 01:31:51 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 995638E0001; Fri, 10 Jan 2020 01:31:51 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8ABAA8E0005; Fri, 10 Jan 2020 01:31:51 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id 74BFF8E0001 for ; Fri, 10 Jan 2020 01:31:51 -0500 (EST) Received: from smtpin27.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with SMTP id 372FF180AD806 for ; Fri, 10 Jan 2020 06:31:51 +0000 (UTC) X-FDA: 76360754022.27.size69_848e51ea8b455 X-HE-Tag: size69_848e51ea8b455 X-Filterd-Recvd-Size: 5110 Received: from mail-wm1-f42.google.com (mail-wm1-f42.google.com [209.85.128.42]) by imf02.hostedemail.com (Postfix) with ESMTP for ; Fri, 10 Jan 2020 06:31:50 +0000 (UTC) Received: by mail-wm1-f42.google.com with SMTP id p17so726731wma.1 for ; Thu, 09 Jan 2020 22:31:50 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:date:from:to:cc:subject:message-id:references :mime-version:content-disposition:in-reply-to:user-agent; bh=Yrm6NEizNFg5aOT9NDnGTFqQS30CbndtpoTGtGdJOC4=; b=qAzhy2T3eBxtSH4yTVq7Ff2qhNjsDsa/aQeYMcUkV84TwwO/tgkTw5APKhzgx0bJCp FavhRueaDZH38CR17bzHRJnL2Mg1qpNj9Lhzj6epNwMlNWilmL7BVI+T1TeUizybtVXd bvYr8Xhe4blfvVdS+rT5zyLVnnj37CkX33r958UDJ01ZN7F9ub8BYwvuvcCQvMlbmKl/ m2GWpPkRiTkoxves2+h6HH21SzvV5ar4wD4w2DZiwqK2K0zdlOtpQMP3g4VHyE2bhagq 9mFjoKrL3jiNHXHMrp/pU/rn5h3bq7Dy0d+5nIMclJKP4eKbetBppx05vbJ1cZpDQUPT Ev8A== X-Gm-Message-State: APjAAAXblP5ygcRL0utX4lOUX/Op0uXhD/lE5lssUxsRnEFwYoTFjbPT egFZDEpRckhILQZF7yv/4JsO4xEN X-Google-Smtp-Source: APXvYqzUQx9lUgC3/oVdrF41Ln8xwVkpFHiLPmOZNkwCmDu5NtdVt2i9v2Cs9hIQ8KoeDEiorJz8HQ== X-Received: by 2002:a05:600c:d5:: with SMTP id u21mr2052278wmm.85.1578637909737; Thu, 09 Jan 2020 22:31:49 -0800 (PST) Received: from localhost (ip-37-188-146-105.eurotel.cz. [37.188.146.105]) by smtp.gmail.com with ESMTPSA id c17sm1013290wrr.87.2020.01.09.22.31.47 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Thu, 09 Jan 2020 22:31:48 -0800 (PST) Date: Fri, 10 Jan 2020 07:31:47 +0100 From: Michal Hocko To: Pavel Machek Cc: kernel list , Andrew Morton , linux-mm@kvack.org, akpm@linux-foundation.org Subject: Re: OOM killer not nearly agressive enough? Message-ID: <20200110063147.GB29802@dhcp22.suse.cz> References: <20200107204412.GA29562@amd> <20200109115633.GR4951@dhcp22.suse.cz> <20200109210307.GA1553@duo.ucw.cz> <20200109212516.GA23620@dhcp22.suse.cz> <20200109224845.GA1220@amd> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20200109224845.GA1220@amd> User-Agent: Mutt/1.12.2 (2019-09-21) X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Thu 09-01-20 23:48:45, Pavel Machek wrote: > Hi! > > > > > > Do we agree that OOM killer should have reacted way sooner? > > > > > > > > This is impossible to answer without knowing what was going on at the > > > > time. Was the system threshing over page cache/swap? In other words, is > > > > the system completely out of memory or refaulting the working set all > > > > the time because it doesn't fit into memory? > > > > > > Swap was full, so "completely out of memory", I guess. Chromium does > > > that fairly often :-(. > > > > The oom heuristic is based on the reclaim failure. If the reclaim makes > > some progress then the oom killer is not hit. Have a look at > > should_reclaim_retry for more details. > > Thanks for pointer. > > I guess setting MAX_RECLAIM_RETRIES to 1 is not something you'd > recommend? :-). You can certainly play with that. I am not overly optimistic that would help though because symptoms of a threshing system is that we actually do not even reach this point. Pages are simply recycled but they evict other part of the hot working set. But I am only guessing what is the problem in your case. Anyway MAX_RECLAIM_RETRIES would tend to be more timing sensitive in general. If the reclaim progress cannot be made because of IO latencies or other resource depletion then the OOM be declared too early. The current MAX_RECLAIM_RETRIES is not something we have tuned for in any sense. I remember it didn't make much difference to change it unless the number would be really high which would be signal that the reclaim is not throttled very well. > > > PSI is completely different system, but I guess > > > I should attempt to tweak the existing one first... > > > > PSI is measuring the cost of the allocation (among other things) and > > that can give you some idea on how much time is spent to get memory. > > Userspace can implement a policy based on that and act. The kernel oom > > killer is the last resort when there is really no memory to > > allocate. > > So what I'm seeing is system that is unresponsive, easily for an hour. > > Sometimes, I'm able to log in. When I could do that, system was > absurdly slow, like ps printing at more than 10 seconds per line. > ps on my system takes 300msec, estimate in the slow case would be 2000 > seconds, that is slowdown by factor of 6000x. That would be X terminal > opening in like two hours... that's not really usable. It would be great to find out what is the bottle neck. Is the allocator stuck in the memory reclaim? Waiting on some lock? Reclaiming pages which are stolen by other contending processes? -- Michal Hocko SUSE Labs