From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.1 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2C04EC433B4 for ; Thu, 6 May 2021 19:11:00 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 9C66E613B5 for ; Thu, 6 May 2021 19:10:59 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 9C66E613B5 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id F2F1F6B006C; Thu, 6 May 2021 15:10:58 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id EB7CE6B006E; Thu, 6 May 2021 15:10:58 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id CBCF86B0070; Thu, 6 May 2021 15:10:58 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id AB6756B006C for ; Thu, 6 May 2021 15:10:58 -0400 (EDT) Received: from smtpin01.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id 5F8BD988D for ; Thu, 6 May 2021 19:10:58 +0000 (UTC) X-FDA: 78111748596.01.CA0B946 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 81B5F90009F1 for ; Thu, 6 May 2021 19:10:25 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1620328257; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=BLQS/VWO7OtVVjsKjxu+D/KAUrGLMXv+CIgGZgOS2ZA=; b=VkisTBs6/M5LYJWUBxrWQuqKQAj6BE9S/OMsQ4XK4h5aFKoHVkgzKTt6ucsQR/c+dBVThF dTEfGlN2tZngN8cEODxpp2ipk1mB+RY+qyLzEnrTvXf0lg8K9Drss1cxAqwQDPmhomGUm4 ytUuSf7nyFNC4H4sei8QJgvGhr/S3KI= Received: from mail-ed1-f72.google.com (mail-ed1-f72.google.com [209.85.208.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-258-HN1_VHx-MaivLAZS-LstVw-1; Thu, 06 May 2021 15:10:55 -0400 X-MC-Unique: HN1_VHx-MaivLAZS-LstVw-1 Received: by mail-ed1-f72.google.com with SMTP id y17-20020a0564023591b02903886c26ada4so3154939edc.5 for ; Thu, 06 May 2021 12:10:55 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=BLQS/VWO7OtVVjsKjxu+D/KAUrGLMXv+CIgGZgOS2ZA=; b=B3jt5Z92vYsxDR7drqtBRbHYn31dFnyhPzE0Msr6jDfLzbPiBV0G06oOcqv1L8VKkb Q3bP6WiXj/rfhkhXyC0yjRfvL1AtEB+CnIu5tdKkQtpqxDp1rmg1usMIRo8XuZRAoDdg tRdCllBUoOvk60QKoU4sFbcmM0xat12yRElPOAkCwhRQhq2OlHN0Dm+sHz5fQiE715tj oQvbYlTuinxO3e+NshJEr/i7JxD86sgFn3LSdB/5gg0eLw0pxI87PSM15KHBHCIT/Hyo RtHUkFf4mKFBFclu+N4m8Mo4sOp4IOAFXyS7W5ZLN2eST0+gxXCrc1RP4ox5tysTgxz1 k6mg== X-Gm-Message-State: AOAM5309TpWihd0cl07jduQfjWrqXkJEO7kzyOGJsQZ7oMEBbSkidgnR rM0QoLPNqzjWc994hjohE/cBJjqrEjcZshokwofbMekdSwBIpEkuXW1gPSkyyHnk7qleZ4yYBVT KWpq/KpYkUabmaFlb9WYEvtVd/K4U7o2sCjVEC7IZXCMXJlNA+l/PcgxvQ70= X-Received: by 2002:a05:6402:4242:: with SMTP id g2mr7015515edb.329.1620328254250; Thu, 06 May 2021 12:10:54 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxZkjeBTygPaDXDtboP122O0OliN5NROL2yY9g10gw8dg8OmnhZOJqX4e3LHC7dUY0iB6oRYw== X-Received: by 2002:a05:6402:4242:: with SMTP id g2mr7015441edb.329.1620328253607; Thu, 06 May 2021 12:10:53 -0700 (PDT) Received: from [192.168.3.132] (p5b0c64ae.dip0.t-ipconnect.de. [91.12.100.174]) by smtp.gmail.com with ESMTPSA id rs8sm1920481ejb.17.2021.05.06.12.10.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 06 May 2021 12:10:53 -0700 (PDT) To: Zi Yan Cc: Oscar Salvador , Michael Ellerman , Benjamin Herrenschmidt , Thomas Gleixner , x86@kernel.org, Andy Lutomirski , "Rafael J . Wysocki" , Andrew Morton , Mike Rapoport , Anshuman Khandual , Michal Hocko , Dan Williams , Wei Yang , linux-ia64@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org, linux-mm@kvack.org References: <20210506152623.178731-1-zi.yan@sent.com> <9D7FD316-988E-4B11-AC1C-64FF790BA79E@nvidia.com> <3a51f564-f3d1-c21f-93b5-1b91639523ec@redhat.com> <16962E62-7D1E-4E06-B832-EC91F54CC359@nvidia.com> <3A6D54CF-76F4-4401-A434-84BEB813A65A@nvidia.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH 0/7] Memory hotplug/hotremove at subsection size Message-ID: <0e850dcb-c69a-188b-7ab9-09e6644af3ab@redhat.com> Date: Thu, 6 May 2021 21:10:52 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.8.1 MIME-Version: 1.0 In-Reply-To: <3A6D54CF-76F4-4401-A434-84BEB813A65A@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=VkisTBs6; spf=none (imf19.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 81B5F90009F1 X-Stat-Signature: fu7dyoi4w1gbsymxa6dhbmzzwzoigm5g Received-SPF: none (redhat.com>: No applicable sender policy available) receiver=imf19; identity=mailfrom; envelope-from=""; helo=us-smtp-delivery-124.mimecast.com; client-ip=170.10.133.124 X-HE-DKIM-Result: pass/pass X-HE-Tag: 1620328225-950130 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: >> >> 1. Pageblock size >> >> There are a couple of features that rely on the pageblock size to be r= easonably small to work as expected. One example is virtio-balloon free p= age reporting, then there is virtio-mem (still also glued MAX_ORDER) and = we have CMA (still also glued to MAX_ORDER). Most probably there are more= . We track movability/ page isolation per pageblock; it's the smallest gr= anularity you can effectively isolate pages or mark them as CMA (MIGRATE_= ISOLATE, MIGRATE_CMA). Well, and there are "ordinary" THP / huge pages mo= st of our applications use and will use, especially on smallish systems. >> >> Assume you bump up the pageblock order to 1G. Small VMs won't be able = to report any free pages to the hypervisor. You'll take the "fine-grained= " out of virtio-mem. Each CMA area will have to be at least 1G big, which= turns CMA essentially useless on smallish systems (like we have on arm64= with 64k base pages -- pageblock_size is 512MB and I hate it). >=20 > I understand the issue of having large pageblock in small systems. My p= lan for this issue is to make MAX_ORDER a variable (pageblock size would = be set according to MAX_ORDER) that can be adjusted based on total memory= and via boot time parameter. My apology since I did not state this clear= ly in my cover letter and it confused you. When we have a boot time adjus= table MAX_ORDER, large pageblock like 1GB would only appear for systems w= ith large memory. For small VMs, pageblock size would stay at 2MB, so all= your concerns on smallish systems should go away. I have to admit that I am not really a friend of that. I still think our=20 target goal should be to have gigantic THP *in addition to* ordinary=20 THP. Use gigantic THP where enabled and possible, and just use ordinary=20 THP everywhere else. Having one pageblock granularity is a real=20 limitation IMHO and requires us to hack the system to support it to some=20 degree. >=20 >> >> Then, imagine systems that have like 4G of main memory. By stopping gr= ouping at 2M and instead grouping at 1G you can very easily find yourself= in the system where all your 4 pageblocks are unmovable and you essentia= lly don't optimize for huge pages in that environment any more. >> >> Long story short: we need a different mechanism on top and shall leave= the pageblock size untouched, it's too tightly integrated with page isol= ation, ordinary THP, and CMA. >=20 > I think it is better to make pageblock size adjustable based on total m= emory of a system. It is not reasonable to have the same pageblock size a= cross systems with memory sizes from <1GB to several TBs. Do you agree? >=20 I suggest an additional mechanism on top. Please bear in mind that=20 ordinary THP will most probably be still the default for 99.9% of all=20 application/library cases, even when you have gigantic THP around. >> >> 2. Section size >> >> I assume the only reason you want to touch that is because pageblock_s= ize <=3D section_size, and I guess that's one of the reasons I dislike it= so much. Messing with the section size really only makes sense when we w= ant to manage metadata for larger granularity within a section. >=20 > Perhaps it is worth checking if it is feasible to make pageblock_size >= section_size, so we can still have small sections when pageblock_size ar= e large. One potential issue for that is when PFN discontinues at section= boundary, we might have partial pageblock when pageblock_size is big. I = guess supporting partial pageblock (or different pageblock sizes like you= mentioned below ) would be the right solution. >=20 >> >> We allocate metadata per section. We mark whole sections early/online/= present/.... Yes, in case of vmemmap, we manage the memmap in smaller gra= nularity using the sub-section map, some kind of hack to support some ZON= E_DEVICE cases better. >> >> Let's assume we introduce something new "gigapage_order", correspondin= g to 1G. We could either decide to squeeze the metadata into sections, ha= ving to increase the section size, or manage that metadata differently. >> >> Managing it differently certainly makes the necessary changes easier. = Instead of adding more hacks into sections, rather manage that metadata a= t differently place / in a different way. >=20 > Can you elaborate on managing it differently? Let's keep it simple. Assume you track on a 1G gigpageblock MOVABLE vs.=20 !movable in addition to existing pageblocks. A 64 TB system would have=20 64*1024 gigpageblocks. One bit per gigapageblock would require 8k a.k.a.=20 2 pages. If you need more states, it would maybe double. No need to=20 manage that using sparse memory sections IMHO. Just allocate 2/4 pages=20 during boot for the bitmap. >=20 >> >> See [1] for an alternative. Not necessarily what I would dream off, bu= t just to showcase that there might be alternative to group pages. >=20 > I saw this patch too. It is an interesting idea to separate different a= llocation orders into different regions, but it would not work for gigant= ic page allocations unless we have large pageblock size to utilize existi= ng anti-fragmentation mechanisms. Right, any anti-fragmentation mechanism on top. >> 3. Grouping pages > pageblock_order >> >> There are other approaches that would benefit from grouping at > pageb= lock_order and having bigger MAX_ORDER. And that doesn't necessarily mean= to form gigantic pages only, we might want to group in multiple granular= ity on a single system. Memory hot(un)plug is one example, but also optim= izing memory consumption by powering down DIMM banks. Also, some architec= tures support differing huge page sizes (aarch64) that could be improved = without CMA. Why not have more than 2 THP sizes on these systems? >> >> Ideally, we'd have a mechanism that tries grouping on different granul= arity, like for every order in pageblock_order ... max_pageblock_order (e= .g., 1 GiB), and not only add one new level of grouping (or increase the = single grouping size). >=20 > I agree. In some sense, supporting partial pageblock and increasing pag= eblock size (e.g., to 1GB) is, at the high level, quite similar to having= multiple pageblock sizes. But I am not sure if we really want to support= multiple pageblock sizes, since it creates pageblock fragmentation when = we want to change migratetype for part of a pageblock. This means we woul= d break a large pageblock into small ones if we just want to steal a subs= et of pages from MOVEABLE for UNMOVABLE allocations. Then pageblock loses= its most useful anti-fragmentation feature. Also it seems to be a replic= ation of buddy allocator functionalities when it comes to pageblock split= and merge. Let's assume for simplicity that you have a 4G machine, maximum 4=20 gigantic pages. The first gigantic page will be impossible either way=20 due to the kernel, boot time allocations etc. So you're left with 3=20 gigantic pages you could use at best. Obviously, you want to make sure that the remaining parts of the first=20 gigantic page are used as best as possible for ordinary huge pages, so=20 you would actually want to group them in 2 MiB chunks and avoid=20 fragmentation there. Obviously, supporting two pageblock types would require core=20 modifications to support it natively. (not pushing for the idea of two=20 pageblock orders, just motivating why we actually want to keep grouping=20 for ordinary THP). >=20 >=20 > The above is really a nice discussion with you on pageblock, section, m= emory hotplug/hotremove, which also helps me understand more on the issue= s with increasing MAX_ORDER to enable 1GB page allocation. >=20 > In sum, if I get it correctly, the issues I need to address are: >=20 > 1. large pageblock size (which is needed when we bump MAX_ORDER for gig= antic page allocation from buddy allocator) is not good for machines with= small memory. >=20 > 2. pageblock size is currently tied with section size (which made me wa= nt to bump section size). >=20 >=20 > For 1, I think making MAX_ORDER a variable that can be set based on tot= al memory size and adjustable via boot time parameter should solve the pr= oblem. For small machines, we will keep MAX_ORDER as small as we have now= like 4MB, whereas for large machines, we can increase MAX_ORDER to utili= ze gigantic pages. >=20 > For 2, supporting partial pageblock and allow a pageblock to cross mult= iple sections would break the tie between pageblock size and section to s= olve the issue. >=20 > I am going to look into them. What do you think? I am not sure that's really the right direction as stated above. --=20 Thanks, David / dhildenb