From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-4.1 required=3.0 tests=BAYES_00,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS autolearn=no autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 4BE74C433E1 for ; Tue, 28 Jul 2020 00:38:24 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 05DAD20809 for ; Tue, 28 Jul 2020 00:38:23 +0000 (UTC) Authentication-Results: mail.kernel.org; dkim=pass (1024-bit key) header.d=linux-foundation.org header.i=@linux-foundation.org header.b="Y9ZcA3LM" DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 05DAD20809 Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=linux-foundation.org Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 76A5F6B0002; Mon, 27 Jul 2020 20:38:23 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 719E06B0005; Mon, 27 Jul 2020 20:38:23 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 60BC16B0006; Mon, 27 Jul 2020 20:38:23 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0252.hostedemail.com [216.40.44.252]) by kanga.kvack.org (Postfix) with ESMTP id 48A7C6B0002 for ; Mon, 27 Jul 2020 20:38:23 -0400 (EDT) Received: from smtpin14.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay02.hostedemail.com (Postfix) with ESMTP id D68FE3622 for ; Tue, 28 Jul 2020 00:38:22 +0000 (UTC) X-FDA: 77085623244.14.table65_4205cb526f65 Received: from filter.hostedemail.com (10.5.16.251.rfc1918.com [10.5.16.251]) by smtpin14.hostedemail.com (Postfix) with ESMTP id B14561822987A for ; Tue, 28 Jul 2020 00:38:22 +0000 (UTC) X-HE-Tag: table65_4205cb526f65 X-Filterd-Recvd-Size: 7533 Received: from mail-lj1-f194.google.com (mail-lj1-f194.google.com [209.85.208.194]) by imf01.hostedemail.com (Postfix) with ESMTP for ; Tue, 28 Jul 2020 00:38:22 +0000 (UTC) Received: by mail-lj1-f194.google.com with SMTP id d17so19249018ljl.3 for ; Mon, 27 Jul 2020 17:38:22 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=linux-foundation.org; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=Ta7oAGSOk1hpMj1IM8G7zlCFfLXZkJdHtbS5NNv18qc=; b=Y9ZcA3LMPXiXY0pSaCbmSQKEwtV5OEziKnfGuRteee6ZN/rSR4GQS6+u60HY9VjcuA KQBIRPnQtsFAEJ5tXZ6pXgmvAVPSNELYW/gdJEzo1suuHRpd4tAJyZHFxQtJkTg3PtVY GEvWDRCLJdQRxGNqMWjw9kAUDIsm0lRk4fZPc= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=Ta7oAGSOk1hpMj1IM8G7zlCFfLXZkJdHtbS5NNv18qc=; b=r+JyARrX6NOT25lRGfdIJ9va+C5gSl4TFJde8nAC4oySMPv+TVdCouLq6khWLet/R/ 1w9UHDbRxL7UufcJqLTF1mM64zzz1Cu/+kfxyzGXtYZs9K585PHbMsGdBYsMm1ys+7c2 hmRVxuZo9iPozYOopAjlzR6WLtOASpMNLVWXqEtf2P6DL6tdg+cwtfUpCyY8HuYT7YCS KLeEWEEptCIn9pI7HWL7OYPNkbxzrQOvp9rYKNrhbIx5Gh3ZfxcwM/CLRB7AoNlFFdou qVfI9TqpbBOgTproAH78GZfxrrI/MsG2Vr1k7Sn9TtIIpEefR2p9dOJWwJTgvFhGVhi2 G9rQ== X-Gm-Message-State: AOAM532J9/iQrccwQAijCFbvvNtnQVQtv17OdUEciaXb3ZrwOcPoxDFR WDSiPiSkwCQE9anBfYtXckx90pVqFNs= X-Google-Smtp-Source: ABdhPJwbX17DvIvbwJCMgoHkaYENcb8q6m7A7Aq7YGMGc3UI0cz3RloFEXtKOzlxKzFHvmokDM9BMw== X-Received: by 2002:a2e:3003:: with SMTP id w3mr11415360ljw.273.1595896699802; Mon, 27 Jul 2020 17:38:19 -0700 (PDT) Received: from mail-lj1-f179.google.com (mail-lj1-f179.google.com. [209.85.208.179]) by smtp.gmail.com with ESMTPSA id z8sm3365762lfd.58.2020.07.27.17.38.18 for (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 27 Jul 2020 17:38:18 -0700 (PDT) Received: by mail-lj1-f179.google.com with SMTP id q4so19277243lji.2 for ; Mon, 27 Jul 2020 17:38:17 -0700 (PDT) X-Received: by 2002:a2e:9c92:: with SMTP id x18mr10338762lji.70.1595896697586; Mon, 27 Jul 2020 17:38:17 -0700 (PDT) MIME-Version: 1.0 References: <20200723211432.b31831a0df3bc2cbdae31b40@linux-foundation.org> <20200724041508.QlTbrHnfh%akpm@linux-foundation.org> <0323de82-cfbd-8506-fa9c-a702703dd654@linux.alibaba.com> <20200727110512.GB25400@gaia> <39560818-463f-da3a-fc9e-3a4a0a082f61@linux.alibaba.com> <89c6671a-39ba-d1cc-9bac-2e6717916220@linux.alibaba.com> In-Reply-To: <89c6671a-39ba-d1cc-9bac-2e6717916220@linux.alibaba.com> From: Linus Torvalds Date: Mon, 27 Jul 2020 17:38:01 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [patch 01/15] mm/memory.c: avoid access flag update TLB flush for retried page fault To: Yang Shi Cc: linux-arch , Yu Xu , Catalin Marinas , Andrew Morton , Johannes Weiner , Hillf Danton , Hugh Dickins , Josef Bacik , "Kirill A . Shutemov" , Linux-MM , mm-commits@vger.kernel.org, Will Deacon , Matthew Wilcox Content-Type: text/plain; charset="UTF-8" X-Rspamd-Queue-Id: B14561822987A X-Spamd-Result: default: False [0.00 / 100.00] X-Rspamd-Server: rspam04 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Mon, Jul 27, 2020 at 3:43 PM Yang Shi wrote: > > With the commit ("mm: drop mmap_sem before calling balance_dirty_pages() > in write fault") the retried fault may happen much more frequently than > before since it would drop mmap lock as long as dirty throttling happens. Sure. And that probably explains why it shows up as a regression. That said, the fact that it showed up as a regression that clearly probably means that the whole spurious TLB flush has always been a low-grade problem, and the extra retries just made it much more noticeable because now there was a change to it. The fact that we have that (very questionable) optimization to only do it for writes kind of reinforces that notion - it has happened before, it's just never been fixed properly, and it's just never been noticeable on most machines because this is all a no-op on x86. I think Catalin's patch - with some way to fix the problem with KVM - is the way to go. That said, testing FAULT_FLAG_TRIED and suppressing the spurious TLB fill for that case is certainly always safe. At worst, we'll take another fault, and then do the TLB flush at _that_ point when not retrying. So it's the FAULT_FLAG_WRITE test that I think is bogus, or at least should be protected by some architecture decision (with a comment about why it's ok for that architecture, ie the ARM kind of "old PTE's will never be in the TLB, and if it's not a write fault we know it doesn't depend on the dirty bit either") Of course, it may be that on every architecture that requires SW accessed bits, the "old PTE's will never be in the TLB is true". Except I think I know at least one architecture where that isn't true. On alpha, the way the acccessed bit works is exactly the same way the dirty bit works - except it's done for reads, instead of writes. So on at least one architecture, access faults and dirty faults are 100% equivalent, just using read/write bits respectively. Of course, alpha doesn't really matter any more. But it's an example of an architecture where "old" does not necessarily mean "cannot be in the TLB", and where testing for FAULT_FLAG_WRITE looks buggy. Again: I think in practice, it's really *really* hard to hit the problem with accessed bits, unlike dirty bits. Normally, PTE's are all instantiated young if they are in the TLB. You have to kind of work at it to get an old PTE _and_ then hit the "now access it exactly at the same time from two different CPU's, and watch one CPU keep taking page faults forever because it never flushes its TLB entry". Of course, it is so long since I worked with alpha that maybe there's some other reason this can't happen. Like "PAL-code always flushes the TLB entry of the faulting address". Which all hardware should do, dammit. It's all kinds of stupid to cache a faulting TLB entry. The fault is thousands of times more expensive than a reload would be, even if it were intentional and done repeatedly (which sounds like an insane thing to optimize for anyway). So one way to fix this problem would be to just specify that "every pagefault handler _must_ flush the local-CPU TLB entry that the fault happened for if the architecture doesn't already do that in hardware or microcode". And then we'd just remove the spurious TLB flush code entirely. Linus