From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=BAYES_00, HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A,SPF_HELO_NONE, SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 405FBC2D0A8 for ; Wed, 23 Sep 2020 14:33:47 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id D4D1421D92 for ; Wed, 23 Sep 2020 14:33:46 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org D4D1421D92 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=suse.cz Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 42C636B005D; Wed, 23 Sep 2020 10:33:46 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 403686B0062; Wed, 23 Sep 2020 10:33:46 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2F3A26B006C; Wed, 23 Sep 2020 10:33:46 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0080.hostedemail.com [216.40.44.80]) by kanga.kvack.org (Postfix) with ESMTP id 18B586B005D for ; Wed, 23 Sep 2020 10:33:46 -0400 (EDT) Received: from smtpin26.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id A074A83C32 for ; Wed, 23 Sep 2020 14:33:45 +0000 (UTC) X-FDA: 77294570010.26.trail74_140d02f27157 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin26.hostedemail.com (Postfix) with ESMTP id A206618045FDE for ; Wed, 23 Sep 2020 14:31:28 +0000 (UTC) X-HE-Tag: trail74_140d02f27157 X-Filterd-Recvd-Size: 5615 Received: from mx2.suse.de (mx2.suse.de [195.135.220.15]) by imf43.hostedemail.com (Postfix) with ESMTP for ; Wed, 23 Sep 2020 14:31:27 +0000 (UTC) X-Virus-Scanned: by amavisd-new at test-mx.suse.de Received: from relay2.suse.de (unknown [195.135.221.27]) by mx2.suse.de (Postfix) with ESMTP id 190C2B2C7; Wed, 23 Sep 2020 14:32:04 +0000 (UTC) Subject: Re: [PATCH RFC 0/4] mm: place pages to the freelist tail when onling and undoing isolation To: David Hildenbrand , osalvador@suse.de Cc: linux-kernel@vger.kernel.org, linux-mm@kvack.org, linux-hyperv@vger.kernel.org, xen-devel@lists.xenproject.org, linux-acpi@vger.kernel.org, Andrew Morton , Alexander Duyck , Dave Hansen , Haiyang Zhang , "K. Y. Srinivasan" , Mel Gorman , Michael Ellerman , Michal Hocko , Mike Rapoport , Scott Cheloha , Stephen Hemminger , Wei Liu , Wei Yang References: <5c0910c2cd0d9d351e509392a45552fb@suse.de> From: Vlastimil Babka Message-ID: <67928cbd-950a-3279-bf9b-29b04c87728b@suse.cz> Date: Wed, 23 Sep 2020 16:31:25 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.12.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 9/16/20 9:31 PM, David Hildenbrand wrote: >=20 >=20 >> Am 16.09.2020 um 20:50 schrieb osalvador@suse.de: >>=20 >> =EF=BB=BFOn 2020-09-16 20:34, David Hildenbrand wrote: >>> When adding separate memory blocks via add_memory*() and onlining the= m >>> immediately, the metadata (especially the memmap) of the next block w= ill be >>> placed onto one of the just added+onlined block. This creates a chain >>> of unmovable allocations: If the last memory block cannot get >>> offlined+removed() so will all dependant ones. We directly have unmov= able >>> allocations all over the place. >>> This can be observed quite easily using virtio-mem, however, it can a= lso >>> be observed when using DIMMs. The freshly onlined pages will usually = be >>> placed to the head of the freelists, meaning they will be allocated n= ext, >>> turning the just-added memory usually immediately un-removable. The >>> fresh pages are cold, prefering to allocate others (that might be hot= ) >>> also feels to be the natural thing to do. >>> It also applies to the hyper-v balloon xen-balloon, and ppc64 dlpar: = when >>> adding separate, successive memory blocks, each memory block will hav= e >>> unmovable allocations on them - for example gigantic pages will fail = to >>> allocate. >>> While the ZONE_NORMAL doesn't provide any guarantees that memory can = get >>> offlined+removed again (any kind of fragmentation with unmovable >>> allocations is possible), there are many scenarios (hotplugging a lot= of >>> memory, running workload, hotunplug some memory/as much as possible) = where >>> we can offline+remove quite a lot with this patchset. >>=20 >> Hi David, >>=20 >=20 > Hi Oscar. >=20 >> I did not read through the patchset yet, so sorry if the question is n= onsense, but is this not trying to fix the same issue the vmemmap patches= did? [1] >=20 > Not nonesense at all. It only helps to some degree, though. It solves t= he dependencies due to the memmap. However, it=E2=80=98s not completely i= deal, especially for single memory blocks. >=20 > With single memory blocks (virtio-mem, xen-balloon, hv balloon, ppc dlp= ar) you still have unmovable (vmemmap chunks) all over the physical addre= ss space. Consider the gigantic page example after hotplug. You directly = fragmented all hotplugged memory. >=20 > Of course, there might be (less extreme) dependencies due page tables f= or the identity mapping, extended struct pages and similar. >=20 > Having that said, there are other benefits when preferring other memory= over just hotplugged memory. Think about adding+onlining memory during b= oot (dimms under QEMU, virtio-mem), once the system is up you will have m= ost (all) of that memory completely untouched. >=20 > So while vmemmap on hotplugged memory would tackle some part of the iss= ue, there are cases where this approach is better, and there are even ben= efits when combining both. I see the point, but I don't think the head/tail mechanism is great for t= his. It might sort of work, but with other interfering activity there are no guar= antees and it relies on a subtle implementation detail. There are better mechani= sms possible I think, such as preparing a larger MIGRATE_UNMOVABLE area in th= e existing memory before we allocate those long-term management structures.= Or onlining a bunch of blocks as zone_movable first and only later convert t= o zone_normal in a controlled way when existing normal zone becomes depeted= ? I guess it's an issue that the e.g. 128M block onlines are so disconnecte= d from each other it's hard to employ a strategy that works best for e.g. a whol= e bunch of GB onlined at once. But I noticed some effort towards new API, so mayb= e that will be solved there too? > Thanks! >=20 > David >=20 >>=20 >> I was about to give it a new respin now that thw hwpoison stuff has be= en settled. >>=20 >> [1] https://patchwork.kernel.org/cover/11059175/ >>=20 >=20