From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.2 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI,NICE_REPLY_A, SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 589A1C48BE5 for ; Wed, 16 Jun 2021 11:15:07 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id E957B610A3 for ; Wed, 16 Jun 2021 11:15:06 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org E957B610A3 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 7F8276B0070; Wed, 16 Jun 2021 07:15:06 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 7A8606B0071; Wed, 16 Jun 2021 07:15:06 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5FBE46B0072; Wed, 16 Jun 2021 07:15:06 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0165.hostedemail.com [216.40.44.165]) by kanga.kvack.org (Postfix) with ESMTP id 2A4FC6B0070 for ; Wed, 16 Jun 2021 07:15:06 -0400 (EDT) Received: from smtpin23.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id B89B3AF78 for ; Wed, 16 Jun 2021 11:15:05 +0000 (UTC) X-FDA: 78259330170.23.8A3B508 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf29.hostedemail.com (Postfix) with ESMTP id B783237C for ; Wed, 16 Jun 2021 11:14:51 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623842104; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6rClbIh55ij6DoY9uqd2ukH5YV4uYfoZFeaSd/jhn8U=; b=Qi8279as07OJS6TkQ5DDvfiQOoM3oDkERA+2YD4EKts5f32NfBrrCSoM0XvJywDdorLyZw mdr3jpW3ANYISd15WEdQWIxlWVGPXJ+fdrEywpCfLSFr+IlJzjTowA79bCKq01M/W/eBjR owd8wmllrTUtgyIPdGOzvSx9JNsJ/UQ= Received: from mail-wr1-f70.google.com (mail-wr1-f70.google.com [209.85.221.70]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-519-Je-TmZxnMgSaq0lhWZuVjQ-1; Wed, 16 Jun 2021 07:15:03 -0400 X-MC-Unique: Je-TmZxnMgSaq0lhWZuVjQ-1 Received: by mail-wr1-f70.google.com with SMTP id q15-20020adfc50f0000b0290111f48b865cso968584wrf.4 for ; Wed, 16 Jun 2021 04:15:03 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=6rClbIh55ij6DoY9uqd2ukH5YV4uYfoZFeaSd/jhn8U=; b=JQqo00RFpncC9Ao8YD+aSU3ielnnfnrDjmlVrJlEnSnYJFwGYvisFdeIJ+2DuBRp/u Rz8mm9q9ectBmQRtqJou3rOw2PiT95XIruP8NaDWZfleld1DuMcQo5b1ymC9ntJb/Z4U kvrINGji/iiD9fIm5q9kQuHLIorynxr7EAipPOIWvujMfWDXINUwcf56su3EPOmMEOey 08xR+xDXx0ox8sQbBPwotkRVE9zBKlacY44In8ldLUzZMR0b2nLqLl7mbrXfhM/xpQrF k9Mt+V2I9ZWGbvha4N62OY2Ski+4OJb1L+Zus8gSznJAp/dwGUhayY2jwO6/gO4qqxA/ nQzg== X-Gm-Message-State: AOAM532G8bm7rijxmnonxK2lHO1sN4Qc9dUKC+meTOueRXdJmp8CPtLO Qtnpjea7K6x1eeSIux+2mybDnt5LGcqLkGYEryJrBnrF3UcekTMXPf8ZnYKGZWX+ll6Gebui4Ir hvc+LKSAGmYc= X-Received: by 2002:a1c:1d07:: with SMTP id d7mr8948386wmd.42.1623842102413; Wed, 16 Jun 2021 04:15:02 -0700 (PDT) X-Google-Smtp-Source: ABdhPJzXsoN6z1PnQxUGIJUgIPmldqDTc8tlMKirlgn+x38opQ+WZ7OkwTe9/3SnnarqsHdPu9xdMg== X-Received: by 2002:a1c:1d07:: with SMTP id d7mr8948350wmd.42.1623842102032; Wed, 16 Jun 2021 04:15:02 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6524.dip0.t-ipconnect.de. [91.12.101.36]) by smtp.gmail.com with ESMTPSA id f14sm4642219wmq.10.2021.06.16.04.15.01 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 16 Jun 2021 04:15:01 -0700 (PDT) To: Gavin Shan , linux-mm@kvack.org Cc: linux-kernel@vger.kernel.org, akpm@linux-foundation.org, shan.gavin@gmail.com, Anshuman Khandual , Alexander Duyck References: <20210601033319.100737-1-gshan@redhat.com> <76516781-6a70-f2b0-f3e3-da999c84350f@redhat.com> <0c0eb8c8-463d-d6f1-3cec-bbc0af0a229c@redhat.com> <74b0d35f-707d-aa11-19e7-fedb74d77159@redhat.com> <6ebc99f9-649d-fbd2-aadf-87291e41b36d@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [RFC PATCH] mm/page_reporting: Adjust threshold according to MAX_ORDER Message-ID: Date: Wed, 16 Jun 2021 13:15:01 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=Qi8279as; spf=none (imf29.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Stat-Signature: bwmau6tk7oqy6kyubzxbdgexd1uptsam X-Rspamd-Queue-Id: B783237C X-Rspamd-Server: rspam06 X-HE-Tag: 1623842091-261371 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 16.06.21 14:59, Gavin Shan wrote: > On 6/16/21 5:59 PM, David Hildenbrand wrote: >> On 16.06.21 03:53, Gavin Shan wrote: >>> On 6/14/21 9:03 PM, David Hildenbrand wrote: >>>> On 11.06.21 09:44, Gavin Shan wrote: >>>>> On 6/1/21 6:01 PM, David Hildenbrand wrote: >>>>>> On 01.06.21 05:33, Gavin Shan wrote: >>>>>>> The PAGE_REPORTING_MIN_ORDER is equal to @pageblock_order, taken = as >>>>>>> minimal order (threshold) to trigger page reporting. The page rep= orting >>>>>>> is never triggered with the following configurations and settings= on >>>>>>> aarch64. In the particular scenario, the page reporting won't be = triggered >>>>>>> until the largest (2 ^ (MAX_ORDER-1)) free area is achieved from = the >>>>>>> page freeing. The condition is very hard, or even impossible to b= e met. >>>>>>> >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_ARM64_PAGE_SHIFT:=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 16 >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE:=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 Y >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_HUGETLB_PAGE_SIZE_VARIABLE:=C2=A0= =C2=A0=C2=A0 N >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 pageblock_order:=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 13 >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 CONFIG_FORCE_MAX_ZONEORDER:=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14 >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 MAX_ORDER:=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 14 >>>>>>> >>>>>>> The issue can be reproduced in VM, running kernel with above conf= igurations >>>>>>> and settings. The 'memhog' is used inside the VM to access 512MB = anonymous >>>>>>> area. The QEMU's RSS doesn't drop accordingly after 'memhog' exit= s. >>>>>>> >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 /home/gavin/sandbox/qemu.main/build/qem= u-system-aarch64=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -accel kvm -machine virt,gic-version=3D= host=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -cpu host -smp 8,sockets=3D2,cores=3D4,= threads=3D1 -m 4096M,maxmem=3D64G \ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3Dmem0,si= ze=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -object memory-backend-ram,id=3Dmem1,si= ze=3D2048M=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D0,cpus=3D0-3,memdev= =3Dmem0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 \ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -numa node,nodeid=3D1,cpus=3D4-7,memdev= =3Dmem1=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0 \ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 :=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 \ >>>>>>> =C2=A0=C2=A0=C2=A0=C2=A0 -device virtio-balloon-pci,id=3Dballoon= 0,free-page-reporting=3Dyes >>>>>>> >>>>>>> This tries to fix the issue by adjusting the threshold to the sma= ller value >>>>>>> of @pageblock_order and (MAX_ORDER/2). With this applied, the QEM= U's RSS >>>>>>> drops after 'memhog' exits. >>>>>> >>>>>> IIRC, we use pageblock_order to >>>>>> >>>>>> a) Reduce the free page reporting overhead. Reporting on small chu= nks can make us report constantly with little system activity. >>>>>> >>>>>> b) Avoid splitting THP in the hypervisor, avoiding downgraded VM p= erformance. >>>>>> >>>>>> c) Avoid affecting creation of pageblock_order pages while hinting= is active. I think there are cases where "temporary pulling sub-pagebloc= k pages" can negatively affect creation of pageblock_order pages. Concurr= ent compaction would be one of these cases. >>>>>> >>>>>> The monstrosity called aarch64 64k is really special in that sense= , because a) does not apply because pageblocks are just very big, b) does= sometimes not apply because either our VM isn't backed by (rare) 512MB T= HP or uses 4k with 2MB THP and c) similarly doesn't apply in smallish VMs= because we don't really happen to create 512MB THP either way. >>>>>> >>>>>> >>>>>> For example, going on x86-64 from reporting 2MB to something like = 32KB is absolutely undesired. >>>>>> >>>>>> I think if we want to go down that path (and I am not 100% sure ye= t if we want to), we really want to treat only the special case in a spec= ial way. Note that even when doing it only for aarch64 with 64k, you will= still end up splitting THP in a hypervisor if it uses 64k base pages (b)= ) and can affect creation of THP, for example, when compacting (c), so th= ere is a negative side to that. >>>>>> >>>>> >>>>> [Remove Alexander from the cc list as his mail isn't reachable] >>>>> >>>> >>>> [adding his gmail address which should be the right one] >>>> >>>>> David, thanks for your time to review and sorry for the delay and l= ate response. >>>>> I spent some time to get myself familiar with the code, but there a= re still some >>>>> questions to me, explained as below. >>>>> >>>>> Yes, @pageblock_order is currently taken as page reporting threshol= d. It will >>>>> incur more overhead if the threshold is decreased as you said in (a= ). >>>> >>>> Right. Alex did quite some performance/overhead evaluation when intr= oducing this feature. Changing the reporting granularity on most setups (= esp., x86-64) is not desired IMHO. >>>> >>> >>> Thanks for adding Alex's correct mail address, David. >>> >>>>> >>>>> This patch tries to decrease the free page reporting threshold. The= @pageblock_order >>>>> isn't touched. I don't understand how the code changes affecting TH= P splitting >>>>> and the creation of page blocks mentioned in (b) and (c). David, co= uld you please >>>>> provide more details? >>>> >>>> Think of it like this: while reporting to the hypervisor, we tempora= rily turn free/"movable" pieces part of a pageblock "unmovable" -- see __= isolate_free_page()->del_page_from_free_list(). While reporting them to t= he hypervisor, these pages are not available and not even marked as PageB= uddy() anymore. >>>> >>>> There are at least two scenarios where this could affect creation of= free pageblocks I can see: >>>> >>>> a. Compaction. While compacting, we might identify completely movabl= e/free pageblocks, however, actual compaction on that pageblock can fail = because some part is temporarily unmovable. >>>> >>>> b. Free/alloc sequences. Assume a pageblocks is mostly free, except = two pages (x and y). Assume the following sequence: >>>> >>>> 1. free(x) >>>> 2. free(y) >>>> 3. alloc >>>> >>>> Before your change, after 1. and 2. we'll have a free pageblock. 3 w= on't allocate from that pageblock. >>>> >>>> With your change, free page reporting might run after 1. After 2, we= 'll not have a free pageblock (until free page reporting finished), and 3= . might just reallocate what we freed in 2 and prevent having a free page= block. >>>> >>>> >>>> No idea how relevant both points are in practice, however, the funda= mental difference to current handling is that we would turn parts of page= blocks temporarily unmovable, instead of complete pageblocks. >>>> >>> >>> Thank you for the details. Without my changes and the page reporting = threshold >>> is @pageblock_order, the whole page block can become 'movable' from '= unmovable'. >>> I don't think it's what we want, but I need Alex's confirm. >> >> __isolate_free_page() will set the pageblock MIGRATE_MOVABLE in that c= ase. It's only temporarily unmovable, while we're hinting. >> >> Note that MOVABLE vs. UNMOVABLE is just grouping for free pages, and e= ven setting it to the wrong migratetype isn't "wrong" as in "correctness"= . It doesn't make a difference if there are no free pages because the who= le block is isolated. >> >=20 > Yes, It doesn't matter since these pages have been isolated. The migrat= ion type is changed to MIGRATE_MOVABLE > in __isolated_free_page(). My questions are actually: >=20 > (1) Is it possible the migration type is changed from MIGRATE_UNMOVABLE= to MIGRATE_MOVABLE > in __isolated_free_page()? Yes, if the isolated page covers at least half the pageblock. So either=20 if we isolate the complete pageblock (as it's free, there is nothing=20 unmovable) or half the pageblock. The latter seems to be some heuristic=20 that says if it's half-free, make it MIGRATE_MOVABLE -- maybe because=20 that increases the chances that we might get a completely movable=20 pageblock later (would have too look into the details). > (2) After the free page reporting is completed, the migrate type is res= tored to MIGRATE_UNMOVABLE? No, don't think so. And it also doesn't make too much sense if we=20 decided when isolating that we're better off using MIGRATE_MOVABLE.=20 After all, we're just putting back a free page we previously isolated=20 from the free lists. --=20 Thanks, David / dhildenb