From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.5 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,URIBL_BLOCKED,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 33922C2B9F4 for ; Mon, 14 Jun 2021 11:32:32 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id B46006109F for ; Mon, 14 Jun 2021 11:32:31 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org B46006109F Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 54E8B6B006C; Mon, 14 Jun 2021 07:32:31 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4AEB26B006E; Mon, 14 Jun 2021 07:32:31 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2DB7F6B0070; Mon, 14 Jun 2021 07:32:31 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0135.hostedemail.com [216.40.44.135]) by kanga.kvack.org (Postfix) with ESMTP id ED6086B006C for ; Mon, 14 Jun 2021 07:32:30 -0400 (EDT) Received: from smtpin10.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay01.hostedemail.com (Postfix) with ESMTP id 81C03180AD817 for ; Mon, 14 Jun 2021 11:32:30 +0000 (UTC) X-FDA: 78252116460.10.DA5F0D4 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf23.hostedemail.com (Postfix) with ESMTP id D4AF9A00025F for ; Mon, 14 Jun 2021 11:32:22 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623670349; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=auek9aYzqwxqvPN9zk1SNV1DLZg48GOMnwlLjvRnYkQ=; b=XBULiev4S1TTFStDKUPWRpxmOvTZUreX3f4wK70Lgnem4fJEhWYz8qpnCvKqkeBjxFKHnQ Ljha/br5SZtcOVn5uuS2/GNKgJLcKl8LfxfePtl+2NtSuN9FjrqoqSPNmt1OO+N/ThiDQ5 DnTgQNywKGLSIzKrb+yM7TcX2SBVdh4= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-561-O1weGxjoNNW7sLMQyYjY3Q-1; Mon, 14 Jun 2021 07:32:28 -0400 X-MC-Unique: O1weGxjoNNW7sLMQyYjY3Q-1 Received: by mail-wr1-f69.google.com with SMTP id n2-20020adfb7420000b029010e47b59f31so6851138wre.9 for ; Mon, 14 Jun 2021 04:32:28 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=auek9aYzqwxqvPN9zk1SNV1DLZg48GOMnwlLjvRnYkQ=; b=TUFlNsU1eAvyOlYsBdTiEdhiAyLA0UQlxcsrGuYGHGTSHNDgpGGCpDeilRjW5Pb0ZY rs1JPWCLyrtWyHuOW7R2zI+b7iKpCSj8DQFgUOk7Tuwte+dRNWKeKFdppe1eA9C6VISO clLYmyjF/rCuftEj2eFX4VOLWW7LBv7TK3R0NObsQc4JeRxHwZlXros5dBhppR5KopOv wEBWE/6NBDCB+IhUneFI79jL/UBpHULfmxLe0M7V9qqtvt5DPPLqIJm3iS4qngiYuukn LP4169/tzzVhL2EQoncysBJ72zf2kONwaedR/vkqNVe+FtY1VQD19vbr8fvul1vWg2eM 8s8w== X-Gm-Message-State: AOAM531MRja030EWOuA6GmPzGF+pkxMnp2vtAS0+Vvq0w9t3BXkRsv7h AU9i0n8EXoc18Fk5k0MO355cqXHAmY6aDzxFWq+oAYPlexYEOXS1U07S5IHzutTUUGzFVBou/j4 53pAmBn6tITCEyp27BSXULgbriVS3uUqyxc3uRb2p874rKtvHvO9mdfVHd9o= X-Received: by 2002:a05:600c:198c:: with SMTP id t12mr15621515wmq.16.1623670347449; Mon, 14 Jun 2021 04:32:27 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyvMFL31q/BXB05TUCa992j0Wb2DezhZXpGzHT9FW0vrh5G/XgKNYGQVj5KUX7aymkq5w5vWQ== X-Received: by 2002:a05:600c:198c:: with SMTP id t12mr15621479wmq.16.1623670347138; Mon, 14 Jun 2021 04:32:27 -0700 (PDT) Received: from [192.168.3.132] (p5b0c66ca.dip0.t-ipconnect.de. [91.12.102.202]) by smtp.gmail.com with ESMTPSA id u16sm16870421wru.56.2021.06.14.04.32.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 14 Jun 2021 04:32:26 -0700 (PDT) To: Zi Yan , Michal Hocko Cc: Oscar Salvador , Michael Ellerman , Benjamin Herrenschmidt , Thomas Gleixner , x86@kernel.org, Andy Lutomirski , "Rafael J . Wysocki" , Andrew Morton , Mike Rapoport , Anshuman Khandual , Dan Williams , Wei Yang , linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org References: <20210506152623.178731-1-zi.yan@sent.com> <792d73e2-5d63-74a5-5554-20351d5532ff@redhat.com> <746780E5-0288-494D-8B19-538049F1B891@nvidia.com> <289DA3C0-9AE5-4992-A35A-C13FCE4D8544@nvidia.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size Message-ID: <640bd1da-4bcb-cfda-18c0-da0ddb90b661@redhat.com> Date: Mon, 14 Jun 2021 13:32:25 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: <289DA3C0-9AE5-4992-A35A-C13FCE4D8544@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: D4AF9A00025F Authentication-Results: imf23.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=XBULiev4; spf=none (imf23.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: a4agrpqsqyfj5tu7jh3oq6wozogydwoh X-HE-Tag: 1623670342-251911 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 02.06.21 17:56, Zi Yan wrote: > On 10 May 2021, at 10:36, Zi Yan wrote: >=20 >> On 7 May 2021, at 10:00, David Hildenbrand wrote: >> >>> On 07.05.21 13:55, Michal Hocko wrote: >>>> [I haven't read through respective patches due to lack of time but l= et >>>> me comment on the general idea and the underlying justification] >>>> >>>> On Thu 06-05-21 17:31:09, David Hildenbrand wrote: >>>>> On 06.05.21 17:26, Zi Yan wrote: >>>>>> From: Zi Yan >>>>>> >>>>>> Hi all, >>>>>> >>>>>> This patchset tries to remove the restriction on memory hotplug/ho= tremove >>>>>> granularity, which is always greater or equal to memory section si= ze[1]. >>>>>> With the patchset, kernel is able to online/offline memory at a si= ze independent >>>>>> of memory section size, as small as 2MB (the subsection size). >>>>> >>>>> ... which doesn't make any sense as we can only online/offline whol= e memory >>>>> block devices. >>>> >>>> Agreed. The subsection thingy is just a hack to workaround pmem >>>> alignement problems. For the real memory hotplug it is quite hard to >>>> argue for reasonable hotplug scenarios for very small physical memor= y >>>> ranges wrt. to the existing sparsemem memory model. >>>> >>>>>> The motivation is to increase MAX_ORDER of the buddy allocator and= pageblock >>>>>> size without increasing memory hotplug/hotremove granularity at th= e same time, >>>>> >>>>> Gah, no. Please no. No. >>>> >>>> Agreed. Those are completely independent concepts. MAX_ORDER is can = be >>>> really arbitrary irrespective of the section size with vmemmap spars= e >>>> model. The existing restriction is due to old sparse model not being >>>> able to do page pointer arithmetic across memory sections. Is there = any >>>> reason to stick with that memory model for an advance feature you ar= e >>>> working on? >> >> No. I just want to increase MAX_ORDER. If the existing restriction can >> be removed, that will be great. >> >>> >>> I gave it some more thought yesterday. I guess the first thing we sho= uld look into is increasing MAX_ORDER and leaving pageblock_order and sec= tion size as is -- finding out what we have to tweak to get that up and r= unning. Once we have that in place, we can actually look into better frag= mentation avoidance etc. One step at a time. >> >> It makes sense to me. >> >>> >>> Because that change itself might require some thought. Requiring that= bigger MAX_ORDER depends on SPARSE_VMEMMAP is something reasonable to do= . >> >> OK, if with SPARSE_VMEMMAP MAX_ORDER can be set to be bigger than >> SECTION_SIZE, it is perfectly OK to me. Since 1GB THP support, which I >> want to add ultimately, will require SPARSE_VMEMMAP too (otherwise, >> all page++ will need to be changed to nth_page(page,1)). >> >>> >>> As stated somewhere here already, we'll have to look into making allo= c_contig_range() (and main users CMA and virtio-mem) independent of MAX_O= RDER and mainly rely on pageblock_order. The current handling in alloc_co= ntig_range() is far from optimal as we have to isolate a whole MAX_ORDER = - 1 page -- and on ZONE_NORMAL we'll fail easily if any part contains som= ething unmovable although we don't even want to allocate that part. I act= ually have that on my list (to be able to fully support pageblock_order i= nstead of MAX_ORDER -1 chunks in virtio-mem), however didn't have time to= look into it. >> >> So in your mind, for gigantic page allocation (> MAX_ORDER), alloc_con= tig_range() >> should be used instead of buddy allocator while pageblock_order is kep= t at a small >> granularity like 2MB. Is that the case? Isn=E2=80=99t it going to have= high fail rate >> when any of the pageblocks within a gigantic page range (like 1GB) bec= omes unmovable? >> Are you thinking additional mechanism/policy to prevent such thing hap= pening as >> an additional step for gigantic page allocation? Like your ZONE_PREFER= _MOVABLE idea? >> >>> >>> Further, page onlining / offlining code and early init code most prob= ably also needs care if MAX_ORDER - 1 crosses sections. Memory holes we m= ight suddenly have in MAX_ORDER - 1 pages might become a problem and will= have to be handled. Not sure which other code has to be tweaked (compact= ion? page isolation?). >> >> Can you elaborate it a little more? From what I understand, memory hol= es mean valid >> PFNs are not contiguous before and after a hole, so pfn++ will not wor= k, but >> struct pages are still virtually contiguous assuming SPARSE_VMEMMAP, m= eaning page++ >> would still work. So when MAX_ORDER - 1 crosses sections, additional c= ode would be >> needed instead of simple pfn++. Is there anything I am missing? >> >> BTW, to test a system with memory holes, do you know is there an easy = of adding >> random memory holes to an x86_64 VM, which can help reveal potential m= issing pieces >> in the code? Changing BIOS-e820 table might be one way, but I have no = idea on >> how to do it on QEMU. >> >>> >>> Figuring out what needs care itself might take quite some effort. >>> >>> One thing I was thinking about as well: The bigger our MAX_ORDER, the= slower it could be to allocate smaller pages. If we have 1G pages, split= ting them down to 4k then takes 8 additional steps if I'm, not wrong. Of = course, that's the worst case. Would be interesting to evaluate. >> >> Sure. I am planning to check it too. As a simple start, I am going to = run will it scale >> benchmarks to see if there is any performance difference between diffe= rent MAX_ORDERs. >=20 > I ran vm-scalablity and memory-related will-it-scale on a server with 2= 56GB memory to > see the impact of increasing MAX_ORDER and didn=E2=80=99t see much diff= erence for most of > the workloads like page_fault1, page_fault2, and page_fault3 from will-= it-scale. > But feel free to check the attached complete results and let me know wh= at should be > looked into. Thanks. Right, for will-it-scale it looks like there are mostly minor=20 differences, although I am not sure if the results are really stable=20 (reaching from -6% to +6%). For vm-scalability the numbers seem to vary=20 even more (e.g., stddev of =C2=B1 63%), so I have no idea how expressive = they=20 are. But I guess for these benchmarks, the net change won't really be=20 significant. Thanks! --=20 Thanks, David / dhildenb