From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 756EFEB64D9 for ; Wed, 5 Jul 2023 00:22:06 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 889DA2800C8; Tue, 4 Jul 2023 20:22:05 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 83A1D2800B2; Tue, 4 Jul 2023 20:22:05 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 6DA722800C8; Tue, 4 Jul 2023 20:22:05 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 5B3872800B2 for ; Tue, 4 Jul 2023 20:22:05 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id 2F7C9160928 for ; Wed, 5 Jul 2023 00:22:05 +0000 (UTC) X-FDA: 80975655810.19.87FA466 Received: from mail-qt1-f177.google.com (mail-qt1-f177.google.com [209.85.160.177]) by imf28.hostedemail.com (Postfix) with ESMTP id 5D267C0003 for ; Wed, 5 Jul 2023 00:22:03 +0000 (UTC) Authentication-Results: imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=2gstzQ66; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1688516523; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UXr7GRLlBKX4b8iUsbUAWkSJYd3+XLs/QXyV/h+r82o=; b=Uy4vnYw6z95w9Poks4+eGY+yUkZDYpoHt5nspQY4OVXDFmbPShIE2yb+nUquuPAteO88F5 Q+S5KdyUbj5XNJluw/oPBRx+SSS+uESPlXY/h08WQX3B7Vgnkt2tI1V0SI5uz7itSMXm/4 B1wxBHCV7xyNKbuBtCsv4qPuZScyO98= ARC-Authentication-Results: i=1; imf28.hostedemail.com; dkim=pass header.d=google.com header.s=20221208 header.b=2gstzQ66; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf28.hostedemail.com: domain of yuzhao@google.com designates 209.85.160.177 as permitted sender) smtp.mailfrom=yuzhao@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1688516523; a=rsa-sha256; cv=none; b=57tsu+vllrkVzPKFUfVUnpXHWmWfY6fHB5PTQXnKOyn92f/M2Kf1gHd+DlXvsfKzFUbcFp FRPC10P6OqAbY8OtGmw4w3PeP5wF7AgTwkUNlKpAFrobHExKGtv3R4gS1pgF6qKiHvBwW2 bZatA+5ZwtTnackOza9FLnWCzDhOpgU= Received: by mail-qt1-f177.google.com with SMTP id d75a77b69052e-40371070eb7so246241cf.1 for ; Tue, 04 Jul 2023 17:22:03 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20221208; t=1688516522; x=1691108522; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=UXr7GRLlBKX4b8iUsbUAWkSJYd3+XLs/QXyV/h+r82o=; b=2gstzQ66MBhazWZtX0mKLT/XaluMubOgVx28ED31+TQX1Gg0Jpz/inHl6FOFnEQFlL Wr2xe06RdgzqBVA0wM/O89KZbFt9BC7MUv4n7eP3Z129v2BthG2ZoIApnbIqCqSJq6vW iWQExmyWk4Evnax5i6J+7xgB86pg8HmbkmF3N4ms7cmbd5wxbhRNsbennMNJogNoUlz4 GO2DmZ5bkjwAGyfmOmqSfewAgF3fhsYQip1yIswEMPnmXFEJJJcuUhRfWG9vWTEXZ3lu favnTi4DQXhzEpSDWmza6MqeZ1bcjdlm40G7dmc/ZP5gDfBZhCji5YE2XcAJVx/q03se u/tQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20221208; t=1688516522; x=1691108522; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=UXr7GRLlBKX4b8iUsbUAWkSJYd3+XLs/QXyV/h+r82o=; b=gpOnfQ+Fv3U/MTPAcYGNtovAyp1p5wsqxWpi+kszCX21RH4UHsfw2L0F4AbSFxWgtt 88X0wPddh3zYOEx0+W9oSrWey0W3MOeb6K7QVBXX9QfmDLsiPioKWCFTc5GB10pXjLD1 yxWHu40MEy5xKfB7AovHlocc7mfkEqsrqH01XK8j85jMVtiszqLpX2kelqQGDIYkTyFx OqZUvmgwjboToW0DsvLOFibW3dRcNMoGsjkerv28smZR6RECoiq5d2QrFVqbgXihSX+H 486TqB8CkkgbediQCbGnimZU7JOI2ovQbwTyoKcKCXnc325nUVlc40SFkrPxqs+H5lHU vhJg== X-Gm-Message-State: ABy/qLYerhLlnwsVAHH0XXRGbum3TbkQQsfG918lxDCgu59ENklTb7aG RGs+U/EmoA/pxYHXonO8TrDXTtA8EVtq9wl2Anrm2Q== X-Google-Smtp-Source: APBJJlFxKqQTImwWVc8NmsoGdPEjlLx2JxHDyVUjrvoTiBaFG8w7QkDvWKEDSWpv/5Xz6N1WwgyTj1BdG7ing5HfrKQ= X-Received: by 2002:a05:622a:199c:b0:3f8:99c1:52a1 with SMTP id u28-20020a05622a199c00b003f899c152a1mr29598qtc.10.1688516522353; Tue, 04 Jul 2023 17:22:02 -0700 (PDT) MIME-Version: 1.0 References: <20230703135330.1865927-1-ryan.roberts@arm.com> <69aada71-0b3f-e928-6413-742fe7926576@intel.com> <467afd30-c85a-8b9d-97b9-a9ef9d0983af@arm.com> <449183bd-76ef-2a3a-c3f5-0478a7c574ef@intel.com> In-Reply-To: <449183bd-76ef-2a3a-c3f5-0478a7c574ef@intel.com> From: Yu Zhao Date: Tue, 4 Jul 2023 18:21:26 -0600 Message-ID: Subject: Re: [PATCH v2 0/5] variable-order, large folios for anonymous memory To: Yin Fengwei , Ryan Roberts Cc: Andrew Morton , Matthew Wilcox , "Kirill A. Shutemov" , David Hildenbrand , Catalin Marinas , Will Deacon , Anshuman Khandual , Yang Shi , linux-arm-kernel@lists.infradead.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Queue-Id: 5D267C0003 X-Rspam-User: X-Rspamd-Server: rspam02 X-Stat-Signature: 1c3hk3tmj7n7b1qr9cmwz5zdibeqis1m X-HE-Tag: 1688516523-948626 X-HE-Meta: U2FsdGVkX1+TDLkk4rFg3Hyt88uw+/lTE0kRnEKEW7b3+V7+G8qPr2nOXU/bbR06qAqDlTq6Y7zA6jLLL7O8fWySKrz5y+7drEt5tBIvoSWbPMBPJRaVtAKUhCtVbCmaWDJEyAa44BL/0A8Q0Fac84mjHiXuCMQkwfLLNVMOq73FdkgsWynNMm98fYuFYH4qxhZBvcHcfjtr8q8AV+7ID0G1w8CTfihiamjPu+JBfiEhrvujhL/b4qiLU6Mm/K7Ivnr2Hlxr33n+wHCrI1Kb0GjmXFtc5V4LS6Pry5R+mafn4jg7TdhmK3gqsQS/cNy1lfRJTZiYWZreIcCBxC/ZZMaIsQAGwQJ9TTWDGl1dYC13PkalEj48A6FXf5mOx4pKZl0BxcBgMyYBELk5dwrvnkEmMmb4We0Cazhl2bA/c9WJQkJvj13RDq9oxhMFOx8DB6gLECnTPnIa0t5E/vcgHa7/mLB9PtkgA9UdxfdJWjDrsmmC4Yo/o6xxXpQ2bw9OFLC+00YXbnf1iRl8KF4tiKwUb+e3MlURphwneoxR+1GeE0hWeniY7TFm26NVR7MJuI5E8lhO7bqsE09dbuUkzs3oO2cDTrH/fkPjyN3uEshop6o8iMAA3fqI+PkKfbfGE0SP+iY2lf7I6V1RMeWiXggrTZruOA5+WpRSmE0zKISXskZFLsDPZYglGhG6GzVdYJVEFriTZ5su7hXhipIz+S5SQg+c0hjrTQJTvvFW6mXUYnJ0LuW5SbRyaMNHS9eGlX8ilZ8tYG2n6rs3gpF4cGU+Xu2uNy675u1OEjHX66O0hO2agFjxZlkKhNWiFRuD92zDi/qIQg/GPEl8kaDJK0tMR4lYNnBHznr36/lUIA7UB4qKYXV+EdGCKmKXL7ewUWuVRh9eL83GC7WpW6HwyACrOW+XRllHG5nxm5BFbGfFvC8/VABNyCRRayjtzQ7djp/SQKyiH4efbhD/WfS 0z4pkWSn //rboPEHbafc/yEvoGKDptkNVP/Cy6jGtzjul7QJGEJ0+gXW7gcIIWnurgoj4no0wEo9KLNEnGkIFCRRAVAvo4Gcs4Dy21DGcDwbwe/+jlrLQ5Wrj2cl63rLJrXwyELeYKgW5YhNSezFUjP0KNzoveCj0ehkVwCo9BfgzBGBvXL2REzSTBF2l1TPdzDgWvTSgJQBT6DYxCXC+dZA8AfNrff5p0GRNUmMtAswoiFLhByHBtTU8nYFz3j2RBqauqwz2Dy1iw4+62ktUtds2J5SdD1LAtxOsUW2IA0YnT5XkiuavoN19IzgpQ2Prmg== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Tue, Jul 4, 2023 at 5:53=E2=80=AFPM Yin Fengwei = wrote: > > > > On 7/4/23 23:36, Ryan Roberts wrote: > > On 04/07/2023 08:11, Yu Zhao wrote: > >> On Tue, Jul 4, 2023 at 12:22=E2=80=AFAM Yin, Fengwei wrote: > >>> > >>> On 7/4/2023 10:18 AM, Yu Zhao wrote: > >>>> On Mon, Jul 3, 2023 at 7:53=E2=80=AFAM Ryan Roberts wrote: > >>>>> > >>>>> Hi All, > >>>>> > >>>>> This is v2 of a series to implement variable order, large folios fo= r anonymous > >>>>> memory. The objective of this is to improve performance by allocati= ng larger > >>>>> chunks of memory during anonymous page faults. See [1] for backgrou= nd. > >>>> > >>>> Thanks for the quick response! > >>>> > >>>>> I've significantly reworked and simplified the patch set based on c= omments from > >>>>> Yu Zhao (thanks for all your feedback!). I've also renamed the feat= ure to > >>>>> VARIABLE_THP, on Yu's advice. > >>>>> > >>>>> The last patch is for arm64 to explicitly override the default > >>>>> arch_wants_pte_order() and is intended as an example. If this serie= s is accepted > >>>>> I suggest taking the first 4 patches through the mm tree and the ar= m64 change > >>>>> could be handled through the arm64 tree separately. Neither has any= build > >>>>> dependency on the other. > >>>>> > >>>>> The one area where I haven't followed Yu's advice is in the determi= nation of the > >>>>> size of folio to use. It was suggested that I have a single preferr= ed large > >>>>> order, and if it doesn't fit in the VMA (due to exceeding VMA bound= s, or there > >>>>> being existing overlapping populated PTEs, etc) then fallback immed= iately to > >>>>> order-0. It turned out that this approach caused a performance regr= ession in the > >>>>> Speedometer benchmark. > >>>> > >>>> I suppose it's regression against the v1, not the unpatched kernel. > >>> From the performance data Ryan shared, it's against unpatched kernel: > >>> > >>> Speedometer 2.0: > >>> > >>> | kernel | runs_per_min | > >>> |:-------------------------------|---------------:| > >>> | baseline-4k | 0.0% | > >>> | anonfolio-lkml-v1 | 0.7% | > >>> | anonfolio-lkml-v2-simple-order | -0.9% | > >>> | anonfolio-lkml-v2 | 0.5% | > >> > >> I see. Thanks. > >> > >> A couple of questions: > >> 1. Do we have a stddev? > > > > | kernel | mean_abs | std_abs | mean_rel | s= td_rel | > > |:------------------------- |-----------:|----------:|-----------:|----= ------:| > > | baseline-4k | 117.4 | 0.8 | 0.0% | = 0.7% | > > | anonfolio-v1 | 118.2 | 1 | 0.7% | = 0.9% | > > | anonfolio-v2-simple-order | 116.4 | 1.1 | -0.9% | = 0.9% | > > | anonfolio-v2 | 118 | 1.2 | 0.5% | = 1.0% | > > > > This is with 3 runs per reboot across 5 reboots, with first run after r= eboot > > trimmed (it's always a bit slower, I assume due to cold page cache). So= 10 data > > points per kernel in total. > > > > I've rerun the test multiple times and see similar results each time. > > > > I've also run anonfolio-v2 with Kconfig FLEXIBLE_THP=3Ddisabled and in = this case I > > see the same performance as baseline-4k. > > > > > >> 2. Do we have a theory why it regressed? > > > > I have a woolly hypothesis; I think Chromium is doing mmap/munmap in wa= ys that > > mean when we fault, order-4 is often too big to fit in the VMA. So we f= allback > > to order-0. I guess this is happening so often for this workload that t= he cost > > of doing the checks and fallback is outweighing the benefit of the memo= ry that > > does end up with order-4 folios. > > > > I've sampled the memory in each bucket (once per second) while running = and its > > roughly: > > > > 64K: 25% > > 32K: 15% > > 16K: 15% > > 4K: 45% > > > > 32K and 16K obviously fold into the 4K bucket with anonfolio-v2-simple-= order. > > But potentially, I suspect there is lots of mmap/unmap for the smaller = sizes and > > the 64K contents is more static - that's just a guess though. > So this is like out of vma range thing. > > > > >> Assuming no bugs, I don't see how a real regression could happen -- > >> falling back to order-0 isn't different from the original behavior. > >> Ryan, could you `perf record` and `cat /proc/vmstat` and share them? > > > > I can, but it will have to be a bit later in the week. I'll do some mor= e test > > runs overnight so we have a larger number of runs - hopefully that migh= t tell us > > that this is noise to a certain extent. > > > > I'd still like to hear a clear technical argument for why the bin-packi= ng > > approach is not the correct one! > My understanding to Yu's (Yu, correct me if I am wrong) comments is that = we > postpone this part of change and make basic anon large folio support in. = Then > discuss which approach we should take. Maybe people will agree retry is t= he > choice, maybe other approach will be taken... > > For example, for this out of VMA range case, per VMA order should be cons= idered. > We don't need make decision that the retry should be taken now. I've articulated the reasons in another email. Just summarize the most important point here: using more fallback orders makes a system reach equilibrium faster, at which point it can't allocate the order of arch_wants_pte_order() anymore. IOW, this best-fit policy can reduce the number of folios of the h/w prefered order for a system running long enough.