From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 24687C3DA63 for ; Tue, 23 Jul 2024 09:49:52 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id A80E46B0085; Tue, 23 Jul 2024 05:49:51 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A09E56B0088; Tue, 23 Jul 2024 05:49:51 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 883736B0089; Tue, 23 Jul 2024 05:49:51 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 660A66B0085 for ; Tue, 23 Jul 2024 05:49:51 -0400 (EDT) Received: from smtpin05.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay05.hostedemail.com (Postfix) with ESMTP id 1678441D3D for ; Tue, 23 Jul 2024 09:49:51 +0000 (UTC) X-FDA: 82370545782.05.2CFB3CD Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.14]) by imf20.hostedemail.com (Postfix) with ESMTP id 14B1B1C0010 for ; Tue, 23 Jul 2024 09:49:47 +0000 (UTC) Authentication-Results: imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=S9LOB5rw; spf=none (imf20.hostedemail.com: domain of kirill.shutemov@linux.intel.com has no SPF policy when checking 198.175.65.14) smtp.mailfrom=kirill.shutemov@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1721728143; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=i3Rb7N2uWmI+NLU6lAv1D1RDesmf8svF7uJicYMXNvM=; b=nbBIrFU4GnB7xS5E6PhTaUCJ3rqGfd0aPWRg7CEFjmH0HMNHt354yiTcSYa1DQkK6j6OJ3 oI4xdCIUHnOk60P2xycHTQhOkmQjF43iYPj+z1s8ez+K/8xXKRDiDlbrOkrWQGMZmuhFmG r6BNg3/A5CGMVxpJ0+XK16MVtdUd2sg= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1721728143; a=rsa-sha256; cv=none; b=o7BvH0P+Gfa484rLrHbaSyUrY4jZe94AXf0OI0iHu4NMviyiFNb3ravOFVSpG8Ri5hs6bw GXpITPWadL9aVNtmcWhg4oD0yPSHUneQq6maRNc3pUgNr4Nanj8hN2XiaFVkpKvvtLQ3yx Xmj907eWyPLXrvVPrLBPXPlF8BQ+T6M= ARC-Authentication-Results: i=1; imf20.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=S9LOB5rw; spf=none (imf20.hostedemail.com: domain of kirill.shutemov@linux.intel.com has no SPF policy when checking 198.175.65.14) smtp.mailfrom=kirill.shutemov@linux.intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1721728189; x=1753264189; h=date:from:to:cc:subject:message-id:references: mime-version:in-reply-to; bh=J7B57AhsYu2mlW6Prap8g6w5QYT11RX/aKaEuAJH0jg=; b=S9LOB5rwy3vejobfZCPy0UHPSeLnF296KALm2CeuoTPgxtXysjzGgmx0 vL1Qy8jBqPcl/QtoHd6JlOgvzumXKLEMNPn3+cPZLKCEHNdI9i/bBhXgq aKDlW7ZQ5OaT54uto3SrACIxjtlqlS062k0lghxk376tomlGpsLkNyv3p GC4T08bR9AMmop6CdXrl5uTW/f264R6OlpD4AjUoUZF1R5wMHHGVN31zY AYXZk1/HOLElf2Vzv7LhUSe+0n06+KRX1QqmCXjHC2rL+XXNM29E7IWnO VS21XKc3z4jGdifCnjwlYYrR73tXETNNmCOPl9tn46YZxN/q0776FSknp A==; X-CSE-ConnectionGUID: Pt/CtAcnSnGhoczdapGEpQ== X-CSE-MsgGUID: 8NGO4dYFRgqgrLRNOjHlbw== X-IronPort-AV: E=McAfee;i="6700,10204,11141"; a="23144334" X-IronPort-AV: E=Sophos;i="6.09,230,1716274800"; d="scan'208";a="23144334" Received: from fmviesa005.fm.intel.com ([10.60.135.145]) by orvoesa106.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 23 Jul 2024 02:49:46 -0700 X-CSE-ConnectionGUID: +hxMdSabT9iOXv0AKysMLw== X-CSE-MsgGUID: 8UC81fLbQTCbkfSv8LEFig== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.09,230,1716274800"; d="scan'208";a="56480785" Received: from black.fi.intel.com ([10.237.72.28]) by fmviesa005.fm.intel.com with ESMTP; 23 Jul 2024 02:49:42 -0700 Received: by black.fi.intel.com (Postfix, from userid 1000) id 6C554178; Tue, 23 Jul 2024 12:49:41 +0300 (EEST) Date: Tue, 23 Jul 2024 12:49:41 +0300 From: "Kirill A. Shutemov" To: Vlastimil Babka Cc: Michal Hocko , Andrew Morton , "Borislav Petkov (AMD)" , Mel Gorman , Tom Lendacky , Mike Rapoport , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Jianxiong Gao , stable@vger.kernel.org Subject: Re: [PATCH] mm: Fix endless reclaim on machines with unaccepted memory. Message-ID: References: <20240716130013.1997325-1-kirill.shutemov@linux.intel.com> <564ff8e4-42c9-4a00-8799-eaa1bef9c338@suse.cz> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <564ff8e4-42c9-4a00-8799-eaa1bef9c338@suse.cz> X-Rspamd-Queue-Id: 14B1B1C0010 X-Stat-Signature: j4r9f414c4i6rd3gzxxs76t5riienfof X-Rspamd-Server: rspam09 X-Rspam-User: X-HE-Tag: 1721728187-174603 X-HE-Meta: U2FsdGVkX18XC6iYexMjAkvyfYpKHhHvQKeHlRYvyza/pkHSErAwvJLRYqU4S1utQwMyl0+tpv5B5Qfb6auAUV7QhCWoIXToy+g0J1UHReY4yZLzPe/WyrTUs/KN8nwqLCO1xDMdVJvg0Qf3vbkp/0/gfOXmSu9ZLOLRt3czaBvfGEqZpUBYPAhKr+gkFRPfBSuBcpqtoixDQFycrkJnJ3AWZqUNR0F7mx9u78U41/hCUq28g4m3nD9mCXqpfaY55YEjF+nGdyUocDEKiN/l29fBbBsKND2+ztfZ1YxGZOfl1PmVuZozg23kpsu7ZjE44K+s+Ui0XPyThZcjgf3nA9/pjCDZ+lbW+zN8vFXRWlglzmAZEXB6QLwwajWQ9qHTjKd+YXxSaxJz3uoZ2HJ8E/4aETIh2yR0wLgwqlhuH/hgLcfjx4PuZlXIuLZ4I3LDlA6gp7geLr+V2ldXkq0gxtxSgpgQHg4juBlsUUoFiezp4xR6fNiKBLo253TkDA/oaV8Hkq6xfFslLJEd46njC8qt+5njMbHlo5NNZ5pvd2XGEsSBl3QRX2b0Y84QzTJAHS5Tm2V0nqQ6v5oMUtltWmKK2SD54ebvTI89rqkuu5zQKUofJO6Gj66+h2ZgEjnUwx4Gqe3kbYJCK5oh/1gY1spkm4rD1ewfxz2O+f/9Lg6XrfeCxynPlIdQaGKgq4A+O/IB1eYj+32ZrNU98ZBIc/JVg5XL4ncFX/3t8f51gPvxl3XgFKbPrEkVOup2jQVSmdamsOP/ccsTYhE3zvMrP1miF9NMHAFoNNyhOZLlrnrD1F1TdQFgQQ+Ynwjhn2up5yKLZobdSWOjAkhiLwvdeMVARZER8j28w5Fv9uKVYVqkwpcJleBy8QfgIgIN90LW1SM8EhVBl2i9Mt76Ldwe7hKvxkHMSIo8zuy6t9w3uiKYmh1aDSxa8plPFL2CS+epjND1fL3CEu3c6fxLXv4 uxsIqCST EwtshXGYpO82zfzjSQ7CoBE2m9n9FiqjUceSNKiqVCjegOrk8ab1EU1s7qDatLCNEesy90NqqFCRiXH6Vjj82Eiwlb4f61ySmyo2j5bHiKQeYa4Bpi/fTVd9YeDu9I6Dx7ASZZA2YFcfAvxPxEOEVdAQOzE1WlldIf5Bx X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Jul 23, 2024 at 09:30:27AM +0200, Vlastimil Babka wrote: > On 7/22/24 4:07 PM, Kirill A. Shutemov wrote: > > On Wed, Jul 17, 2024 at 02:06:46PM +0200, Michal Hocko wrote: > >> Please try to investigate this further. The patch as is looks rather > >> questionable to me TBH. Spilling unaccepted memory into the reclaim > >> seems like something we should avoid if possible as this is something > > > > Okay, I believe I have a better understanding of the situation: > > > > - __alloc_pages_bulk() takes pages from the free list without accepting > > more memory. This can cause number of free pages to fall below the > > watermark. > > > > This issue can be resolved by accepting more memory in > > __alloc_pages_bulk() if the watermark check fails. > > > > The problem is not only related to unallocated memory. I think the > > deferred page initialization mechanism could result in premature OOM if > > __alloc_pages_bulk() allocates pages faster than deferred page > > initialization can add them to the free lists. However, this scenario is > > unlikely. > > > > - There is nothing that compels the kernel to accept more memory after the > > watermarks have been calculated in __setup_per_zone_wmarks(). This can > > put us under the watermark. > > > > This issue can be resolved by accepting memory up to the watermark after > > the watermarks have been initialized. > > > > - Once kswapd is started, it will begin spinning if we are below the > > watermark and there is no memory that can be reclaimed. Once the above > > problems are fixed, the issue will be resolved. > > > > - The kernel needs to accept memory up to the PROMO watermark. This will > > prevent unaccepted memory from interfering with NUMA balancing. > > So do we still assume all memory is eventually accepted and it's just a > initialization phase thing? And the only reason we don't do everything in a > kthread like the deferred struct page init, is to spread out some potential > contention on the host side? > > If yes, do we need NUMA balancing even to be already active during that phase? No, there is nothing that requires guests to accept all of the memory. If the working set of a workload within the guest only requires a portion of the memory, the rest will remain unallocated and available to the host for other tasks. I think accepting memory up to the PROMO watermark would not hurt anybody. > > The patch below addresses the issues I listed earlier. It is not yet ready > > for application. Please see the issues listed below. > > > > Andrew, please drop the current patch. > > > > There are a few more things I am worried about: > > > > - The current get_page_from_freelist() and patched __alloc_pages_bulk() > > only try to accept memory if the requested (alloc_flags & ALLOC_WMARK_MASK) > > watermark check fails. For example, if a requested allocation with > > ALLOC_WMARK_MIN is called, we will not try to accept more memory, which > > could potentially put us under the high/promo watermark and cause the > > following kswapd start to get us into an endless loop. > > > > Do we want to make memory acceptance in these paths independent of > > alloc_flags? > > Hm ALLOC_WMARK_MIN will proceed, but with a watermark below the low > watermark will still wake up kswapd, right? Isn't that another scenario > where kswapd can start spinning? Yes, that is the concern. > > - __isolate_free_page() removes a page from the free list without > > accepting new memory. The function is called with the zone lock taken. > > It is bad idea to accept memory while holding the zone lock, but > > the alternative of pushing the accept to the caller is not much better. > > > > I have not observed any issues caused by __isolate_free_page() in > > practice, but there is no reason why it couldn't potentially cause > > problems. > > > > - The function take_pages_off_buddy() also removes pages from the free > > list without accepting new memory. Unlike the function > > __isolate_free_page(), it is called without the zone lock being held, so > > we can accept memory there. I believe we should do so. > > > > I understand why adding unaccepted memory handling into the reclaim path > > is questionable. However, it may be the best way to handle cases like > > __isolate_free_page() and possibly others in the future that directly take > > memory from free lists. > > Yes seems it might be not that bad solution, otherwise it could be hopeless > whack-a-mole to prevent all corner cases where reclaim can be triggered > without accepting memory first. > > Although just removing the lazy accept mode would be much more appealing > solution than this :) :P Not really an option for big VMs. It might add many minutes to boot time. > > Any thoughts? > > Wonder if deferred struct page init has many of the same problems, i.e. with > __isolate_free_page() and take_pages_off_buddy(), and if not, why? Even if deferred struct page init would trigger reclaim, kswapd will not spin forever. The background thread will add more free memory, so forward progress is guaranteed. And deferred struct page init is done before init starts. -- Kiryl Shutsemau / Kirill A. Shutemov