From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-10.3 required=3.0 tests=BAYES_00,DKIM_INVALID, DKIM_SIGNED,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, MENTIONS_GIT_HOSTING,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id EB60EC49361 for ; Thu, 17 Jun 2021 13:47:17 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7817761059 for ; Thu, 17 Jun 2021 13:47:17 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 7817761059 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id C1D9F6B006E; Thu, 17 Jun 2021 09:47:16 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id BCEE36B0071; Thu, 17 Jun 2021 09:47:16 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id A205D6B0072; Thu, 17 Jun 2021 09:47:16 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0074.hostedemail.com [216.40.44.74]) by kanga.kvack.org (Postfix) with ESMTP id 6C4346B006E for ; Thu, 17 Jun 2021 09:47:16 -0400 (EDT) Received: from smtpin08.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id F057F824C458 for ; Thu, 17 Jun 2021 13:47:15 +0000 (UTC) X-FDA: 78263342430.08.252247E Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf11.hostedemail.com (Postfix) with ESMTP id 3BA232001106 for ; Thu, 17 Jun 2021 13:47:02 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1623937635; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=wd1jgvwSy93wjHiyys1SAuQrEvjJBrCsIo1P11oaA30=; b=LvjEFZgBTlaX6iumpEMs/Jrn7IOCZi/GRO8mS5FOZhRHasvd7PML/LI3mJUqHdnq14ho7e BcRxTKXYf5zQfxwKP+kGj1N1KEAvr8/sfl7FRX8mRU+eKTbDoxLolT64wu+ydlXKRk2YfW 0Y7er0ja8PK4/Wmw3xg2Oop1ilcZc88= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-580-X1BdSqpDNqueMl5A8kUcLA-1; Thu, 17 Jun 2021 09:47:12 -0400 X-MC-Unique: X1BdSqpDNqueMl5A8kUcLA-1 Received: by mail-wr1-f72.google.com with SMTP id t10-20020a5d49ca0000b029011a61d5c96bso2952892wrs.11 for ; Thu, 17 Jun 2021 06:47:12 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:to:cc:references:from:organization:subject :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=wd1jgvwSy93wjHiyys1SAuQrEvjJBrCsIo1P11oaA30=; b=GIKCD+OtwK9Tn4u/RrlBT1yu/Xz54LPOtxO5ouRWXZRhPn+InFHJyIy4zaaaDzi6N0 QgKbmoJywqtbWNlStIwaUzo8pyQUrfHLxJUK6JtfuZf8WRbDt6QhNkXywL64e6ofR2Iz TCHTSr7oGRODS2EwsTnG64/uNU/aXOrTkXz+f4C5n0QbHE3PiFiVLf42Vyek1RVGPz4g PIOOEm3QLrVYHDf5ibV2hO7LCs+xHIW62RxWaacgqiJwFzkEWlCmHw4EQsKr6HbuBFrl uAmsc9y8XNvjIpAuSTj5OUtfhNKqMOVZe8yc96H2EiVmO7+zoGC2UVSBkVUb0IaNN7fc J58g== X-Gm-Message-State: AOAM530nLBXick5geqxu58uC34+67GzmGVxABZqJuSNwJk2R/yi0Vedw Kp5V9MxLDMhx/feV7DH852phEasfc4EJPjsn+1oJenbdwjEt0uglKCBqX5FHLHo3beorxYWloK1 EKBgE8K0mpRk= X-Received: by 2002:a05:600c:20d:: with SMTP id 13mr5263423wmi.174.1623937631124; Thu, 17 Jun 2021 06:47:11 -0700 (PDT) X-Google-Smtp-Source: ABdhPJyUyVfNf4oswCXiVl+tihhzfOLhyAxMHHePLdWtquIMEW6+6AMiTxB2pqsusdTOV3Wd3VJFxQ== X-Received: by 2002:a05:600c:20d:: with SMTP id 13mr5263382wmi.174.1623937630796; Thu, 17 Jun 2021 06:47:10 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6170.dip0.t-ipconnect.de. [91.12.97.112]) by smtp.gmail.com with ESMTPSA id c7sm5595252wrs.23.2021.06.17.06.47.00 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 17 Jun 2021 06:47:10 -0700 (PDT) To: Michel Lespinasse , Linux-MM , Linux-Kernel Cc: Laurent Dufour , Peter Zijlstra , Michal Hocko , Matthew Wilcox , Rik van Riel , Paul McKenney , Andrew Morton , Suren Baghdasaryan , Joel Fernandes , Andy Lutomirski References: <20210430195232.30491-1-michel@lespinasse.org> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH 00/29] Speculative page faults (anon vmas only) Message-ID: <3047d699-2793-e051-e1eb-deef7c5764a8@redhat.com> Date: Thu, 17 Jun 2021 15:46:57 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.10.1 MIME-Version: 1.0 In-Reply-To: <20210430195232.30491-1-michel@lespinasse.org> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Authentication-Results: imf11.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=LvjEFZgB; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf11.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Stat-Signature: sryqsqqhsgirdduzoa61xigo5kjnqh57 X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 3BA232001106 X-HE-Tag: 1623937622-696017 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 30.04.21 21:52, Michel Lespinasse wrote: > This patchset is my take on speculative page faults (spf). > It builds on ideas that have been previously proposed by Laurent Dufour= , > Peter Zijlstra and others before. While Laurent's previous proposal > was rejected around the time of LSF/MM 2019, I am hoping we can revisit > this now based on what I think is a simpler and more bisectable approac= h, > much improved scaling numbers in the anonymous vma case, and the Androi= d > use case that has since emerged. I will expand on these points towards > the end of this message. >=20 > The patch series applies on top of linux v5.12; > a git tree is also available: > git fetch https://github.com/lespinasse/linux.git v5.12-spf-anon >=20 > I believe these patches should be considered for merging. > My github also has a v5.12-spf branch which extends this mechanism > for handling file mapped vmas too; however I believe these are less > mature and I am not submitting them for inclusion at this point. >=20 >=20 > Compared to the previous (RFC) proposal, I have split out / left out > the file VMA handling parts, fixed some config specific build issues, > added a few more comments and modified the speculative fault handling > to use rcu_read_lock() rather than local_irq_disable() in the > MMU_GATHER_RCU_TABLE_FREE case. >=20 >=20 > Classical page fault processing takes the mmap read lock in order to > prevent races with mmap writers. In contrast, speculative fault > processing does not take the mmap read lock, and instead verifies, > when the results of the page fault are about to get committed and > become visible to other threads, that no mmap writers have been > running concurrently with the page fault. If the check fails, > speculative updates do not get committed and the fault is retried > in the usual, non-speculative way (with the mmap read lock held). >=20 > The concurrency check is implemented using a per-mm mmap sequence count= . > The counter is incremented at the beginning and end of each mmap write > operation. If the counter is initially observed to have an even value, > and has the same value later on, the observer can deduce that no mmap > writers have been running concurrently with it between those two times. > This is similar to a seqlock, except that readers never spin on the > counter value (they would instead revert to taking the mmap read lock), > and writers are allowed to sleep. One benefit of this approach is that > it requires no writer side changes, just some hooks in the mmap write > lock APIs that writers already use. >=20 > The first step of a speculative page fault is to look up the vma and > read its contents (currently by making a copy of the vma, though in > principle it would be sufficient to only read the vma attributes that > are used in page faults). The mmap sequence count is used to verify > that there were no mmap writers concurrent to the lookup and copy steps= . > Note that walking rbtrees while there may potentially be concurrent > writers is not an entirely new idea in linux, as latched rbtrees > are already doing this. This is safe as long as the lookup is > followed by a sequence check to verify that concurrency did not > actually occur (and abort the speculative fault if it did). >=20 > The next step is to walk down the existing page table tree to find the > current pte entry. This is done with interrupts disabled to avoid > races with munmap(). Again, not an entirely new idea, as this repeats > a pattern already present in fast GUP. Similar precautions are also > taken when taking the page table lock. Hi Michel, I just started working on a project to reclaim page tables inside=20 running processes that are no longer needed (for example, empty after=20 madvise(DISCARD)). Long story short, there are scenarios where we want=20 to scan for such page tables asynchronously to free up memory (which can=20 be quite significant in some use cases). Now that I (mostly) understood the complex locking, I'm looking for=20 other mm features that might be "problematic" in that regard and require=20 properly planning to get right (or let them run mutually exclusive). As I essentially rip out page tables from the page table hierarchy to=20 free them (in the simplest case within a VMA to get started), I=20 certainly need the mmap lock in read right now to scan the page table=20 hierarchy, and the mmap lock in write when actually removing a page=20 table. This is similar handling as khugepagd when collapsing a THP and=20 removing a page table. Of course, we could use any kind of=20 synchronization mechanism (-> rcu) to make sure nobody is using a page=20 table anymore before actually freeing it. 1. I now wonder how your code actually protects against e.g., khugepaged=20 and how it could protect against page table reclaim. Will we be using=20 RCU while walking the page tables? That would make life easier. 2. You mention "interrupts disabled to avoid races with munmap()". Can=20 you elaborate how that is supposed to work? Shouldn't we rather be using=20 RCU than manually disabling interrupts? What is the rationale? Thanks a lot in advance! --=20 Thanks, David / dhildenb