From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-12.9 required=3.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,MENTIONS_GIT_HOSTING,NICE_REPLY_A,SPF_HELO_NONE,SPF_PASS, USER_AGENT_SANE_1 autolearn=unavailable autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id D4AECC07E9C for ; Fri, 9 Jul 2021 10:41:22 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 6A30F613D1 for ; Fri, 9 Jul 2021 10:41:22 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 6A30F613D1 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id 2D32F6B0071; Fri, 9 Jul 2021 06:41:22 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 282AB6B0072; Fri, 9 Jul 2021 06:41:22 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 0FCE56B0073; Fri, 9 Jul 2021 06:41:22 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0029.hostedemail.com [216.40.44.29]) by kanga.kvack.org (Postfix) with ESMTP id CE8EE6B0071 for ; Fri, 9 Jul 2021 06:41:21 -0400 (EDT) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay05.hostedemail.com (Postfix) with ESMTP id 0B0C818416412 for ; Fri, 9 Jul 2021 10:41:21 +0000 (UTC) X-FDA: 78342707562.16.155BDDC Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [216.205.24.124]) by imf29.hostedemail.com (Postfix) with ESMTP id 921509000249 for ; Fri, 9 Jul 2021 10:41:20 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1625827280; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=ae+Gb/BEIsgQOybAACBFyFC4tH7km6cb1VeiKPutZiM=; b=RjSjhm13unxw6MeUEDVPUsh2BPzHK2tgCImuK26qWAr5Irmc5XuIqHuzmRjYLiNPiNhnit ZSv9dvrE3sp3h6E+l9C5ssaP12lcxODrKMtCe6XGvG1miaEJkCFQ5Ff2sLU4HzS+M1s+QQ P2GSteTIj+Oc1vWM6DxEA+fnK7wAc0Q= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-101-FjgAy9AxPCuqNGZujM44Wg-1; Fri, 09 Jul 2021 06:41:17 -0400 X-MC-Unique: FjgAy9AxPCuqNGZujM44Wg-1 Received: by mail-wm1-f71.google.com with SMTP id n5-20020a05600c3b85b02902152e9caa1dso3429082wms.3 for ; Fri, 09 Jul 2021 03:41:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:from:to:cc:references:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=ae+Gb/BEIsgQOybAACBFyFC4tH7km6cb1VeiKPutZiM=; b=idvB+U/ICD+ffvkVQ24IDKvzyGIOEziyG2K8yk+6/G4IQKcVd465/790VYmSybR+RS T5r3FqvS/MD7k4rhY3wIgv/6VMLFuCyf7P7/XwbmZe1fe8jNhGpO3k1vnxM+S/LLmvBn CK/pFEqDG8n2GVeQZM9QUcFfVJQXmL6I83ce0RaHedKeFD6QeULAgJqkLtZA8S6NY73M x4PoQDEwSpL6Ow8ltM41nOZ2+iElpQeRmpz2kfYYlxPz6bImCS7KQAAgR3sPBROriPzn vMqUvCFyyZbC9VaYlQAH5scnGtVgwoDQOwKJjyWN+RMih9QLJOkXZNJCws0FqkTtS3WM VS8g== X-Gm-Message-State: AOAM532A/KPa48pl1SZqov+uAG6p1ZcD7IEJ/6Swyjsf4jVSOV2uMera ZSH+Fh2D2SuldwEercE4kEKz6DDlklW2V23mpEN6V7vcZLC7F+dK53Ju5mvk0AOJalYLjV6LdHN OSbBjsEBax7I= X-Received: by 2002:a5d:528d:: with SMTP id c13mr40788009wrv.343.1625827276318; Fri, 09 Jul 2021 03:41:16 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwp4lRsvoO/OSRheLkdIaB+VCVt9XlFPcQgpI8J2XgrxJLWROr/cDL2Oa2hpm3+6ZEkL0kNEA== X-Received: by 2002:a5d:528d:: with SMTP id c13mr40787974wrv.343.1625827275982; Fri, 09 Jul 2021 03:41:15 -0700 (PDT) Received: from [192.168.3.132] (p4ff23a45.dip0.t-ipconnect.de. [79.242.58.69]) by smtp.gmail.com with ESMTPSA id i11sm13140221wmg.18.2021.07.09.03.41.04 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 09 Jul 2021 03:41:15 -0700 (PDT) Subject: Re: [PATCH 00/29] Speculative page faults (anon vmas only) From: David Hildenbrand To: Michel Lespinasse , Linux-MM , Linux-Kernel Cc: Laurent Dufour , Peter Zijlstra , Michal Hocko , Matthew Wilcox , Rik van Riel , Paul McKenney , Andrew Morton , Suren Baghdasaryan , Joel Fernandes , Andy Lutomirski References: <20210430195232.30491-1-michel@lespinasse.org> <3047d699-2793-e051-e1eb-deef7c5764a8@redhat.com> Organization: Red Hat Message-ID: Date: Fri, 9 Jul 2021 12:41:03 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: <3047d699-2793-e051-e1eb-deef7c5764a8@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Authentication-Results: imf29.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=RjSjhm13; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf29.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 216.205.24.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam06 X-Rspamd-Queue-Id: 921509000249 X-Stat-Signature: 35fwejg8mudo43jhhyainzy86ot6r4y5 X-HE-Tag: 1625827280-269900 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 17.06.21 15:46, David Hildenbrand wrote: > On 30.04.21 21:52, Michel Lespinasse wrote: >> This patchset is my take on speculative page faults (spf). >> It builds on ideas that have been previously proposed by Laurent Dufour, >> Peter Zijlstra and others before. While Laurent's previous proposal >> was rejected around the time of LSF/MM 2019, I am hoping we can revisit >> this now based on what I think is a simpler and more bisectable approach, >> much improved scaling numbers in the anonymous vma case, and the Android >> use case that has since emerged. I will expand on these points towards >> the end of this message. >> >> The patch series applies on top of linux v5.12; >> a git tree is also available: >> git fetch https://github.com/lespinasse/linux.git v5.12-spf-anon >> >> I believe these patches should be considered for merging. >> My github also has a v5.12-spf branch which extends this mechanism >> for handling file mapped vmas too; however I believe these are less >> mature and I am not submitting them for inclusion at this point. >> >> >> Compared to the previous (RFC) proposal, I have split out / left out >> the file VMA handling parts, fixed some config specific build issues, >> added a few more comments and modified the speculative fault handling >> to use rcu_read_lock() rather than local_irq_disable() in the >> MMU_GATHER_RCU_TABLE_FREE case. >> >> >> Classical page fault processing takes the mmap read lock in order to >> prevent races with mmap writers. In contrast, speculative fault >> processing does not take the mmap read lock, and instead verifies, >> when the results of the page fault are about to get committed and >> become visible to other threads, that no mmap writers have been >> running concurrently with the page fault. If the check fails, >> speculative updates do not get committed and the fault is retried >> in the usual, non-speculative way (with the mmap read lock held). >> >> The concurrency check is implemented using a per-mm mmap sequence count. >> The counter is incremented at the beginning and end of each mmap write >> operation. If the counter is initially observed to have an even value, >> and has the same value later on, the observer can deduce that no mmap >> writers have been running concurrently with it between those two times. >> This is similar to a seqlock, except that readers never spin on the >> counter value (they would instead revert to taking the mmap read lock), >> and writers are allowed to sleep. One benefit of this approach is that >> it requires no writer side changes, just some hooks in the mmap write >> lock APIs that writers already use. >> >> The first step of a speculative page fault is to look up the vma and >> read its contents (currently by making a copy of the vma, though in >> principle it would be sufficient to only read the vma attributes that >> are used in page faults). The mmap sequence count is used to verify >> that there were no mmap writers concurrent to the lookup and copy steps. >> Note that walking rbtrees while there may potentially be concurrent >> writers is not an entirely new idea in linux, as latched rbtrees >> are already doing this. This is safe as long as the lookup is >> followed by a sequence check to verify that concurrency did not >> actually occur (and abort the speculative fault if it did). >> >> The next step is to walk down the existing page table tree to find the >> current pte entry. This is done with interrupts disabled to avoid >> races with munmap(). Again, not an entirely new idea, as this repeats >> a pattern already present in fast GUP. Similar precautions are also >> taken when taking the page table lock. > > Hi Michel, > > I just started working on a project to reclaim page tables inside > running processes that are no longer needed (for example, empty after > madvise(DISCARD)). Long story short, there are scenarios where we want > to scan for such page tables asynchronously to free up memory (which can > be quite significant in some use cases). > > Now that I (mostly) understood the complex locking, I'm looking for > other mm features that might be "problematic" in that regard and require > properly planning to get right (or let them run mutually exclusive). > > As I essentially rip out page tables from the page table hierarchy to > free them (in the simplest case within a VMA to get started), I > certainly need the mmap lock in read right now to scan the page table > hierarchy, and the mmap lock in write when actually removing a page > table. This is similar handling as khugepagd when collapsing a THP and > removing a page table. Of course, we could use any kind of > synchronization mechanism (-> rcu) to make sure nobody is using a page > table anymore before actually freeing it. > > 1. I now wonder how your code actually protects against e.g., khugepaged > and how it could protect against page table reclaim. Will we be using > RCU while walking the page tables? That would make life easier. > > 2. You mention "interrupts disabled to avoid races with munmap()". Can > you elaborate how that is supposed to work? Shouldn't we rather be using > RCU than manually disabling interrupts? What is the rationale? Answering my questions, I assume this works just like gup_fast lockless_pages_from_mm(), whereby we rely on an IPI when clearing the TLB before actually freeing the page (-> mmu gather). -- Thanks, David / dhildenb