From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 02B1AC433EF for ; Sat, 18 Dec 2021 09:58:21 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 799466B0072; Sat, 18 Dec 2021 04:58:11 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 722C56B0073; Sat, 18 Dec 2021 04:58:11 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 5749C6B0074; Sat, 18 Dec 2021 04:58:11 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0048.hostedemail.com [216.40.44.48]) by kanga.kvack.org (Postfix) with ESMTP id 421706B0072 for ; Sat, 18 Dec 2021 04:58:11 -0500 (EST) Received: from smtpin28.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id E43C28C5A6 for ; Sat, 18 Dec 2021 09:58:00 +0000 (UTC) X-FDA: 78930463920.28.764A59B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf24.hostedemail.com (Postfix) with ESMTP id 1880118003D for ; Sat, 18 Dec 2021 09:57:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1639821479; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=HhnlOnE5Zw/akPI9MUS3YPheTam56HL5gZvtQRAMy6s=; b=YCmIb0J7lX/9L5FjLwGHTaq/bJmeBUTdH7R1bmqNONF39Hzcigts0hylCGaSKsHIPAl0z3 R+rpgaAUhedmsn26WyPtAV16li/wcZ+TA6A2YakNuJlLEZQLF8Id7JRNuKvDyX0FRjLKf/ JizTAgP249PBg4eN3/2Oc1Tn2QRyTwg= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-580-6_3YlR30Mn-gj74iurZs4Q-1; Sat, 18 Dec 2021 04:57:57 -0500 X-MC-Unique: 6_3YlR30Mn-gj74iurZs4Q-1 Received: by mail-wm1-f71.google.com with SMTP id n31-20020a05600c3b9f00b0034440f99123so1382256wms.7 for ; Sat, 18 Dec 2021 01:57:57 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=HhnlOnE5Zw/akPI9MUS3YPheTam56HL5gZvtQRAMy6s=; b=atotomrZwd+l9f7Lpuyf5stNOyVleURx+MqXoLFn5O779ppgCDK9JGvC968cF+f+FR +6t0sVPbd2w/cNOM29yXaak07sfI/CDS4lKEsE2kCe60kPdP4Ydixq2z3QtRHzJVDmts Rk8+wa6HoIGyQiYDPTM9zLUcK48IcZAzB83Y4E+LIbr0ht3J+seBzlJoZJPLygWGq26/ kWu3fo/aFjkEVkobc0ky/ywudt3SzCLtPdu/oHy35rBJNai36GJerrN+pU3yxSJbAJov Xd+b38FOMju5IKCdXrSKhBzT74NaSapIOEXGPTqNagowqqEahrdY/mzQZ7F+suFur1Xo Doww== X-Gm-Message-State: AOAM530lOc5F5IKOG7nLDEHmmva+VlZW01Ga3DzbstnHabRR9lhBqdMg IGOa7tuWESGhemH8Wd/czeaif0zSvpIgHQUlA2/TyiCku+hxXP79gIicM9a7iv0W5w+fG9oCsvw lmzUD8F7Xsss= X-Received: by 2002:adf:eb06:: with SMTP id s6mr5893705wrn.96.1639821476625; Sat, 18 Dec 2021 01:57:56 -0800 (PST) X-Google-Smtp-Source: ABdhPJy9LbEcpllPC7Xh1e1TAUDJFAcJuppWJcec9eB7enbSgMiKNtwU6H4/Sx3BKFYdlbxSdLnFrQ== X-Received: by 2002:adf:eb06:: with SMTP id s6mr5893665wrn.96.1639821476253; Sat, 18 Dec 2021 01:57:56 -0800 (PST) Received: from [192.168.3.132] (p5b0c6703.dip0.t-ipconnect.de. [91.12.103.3]) by smtp.gmail.com with ESMTPSA id i15sm16076211wmq.18.2021.12.18.01.57.54 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Sat, 18 Dec 2021 01:57:55 -0800 (PST) Message-ID: <40e7e0ab-0828-b2e7-339f-35f68a228b3d@redhat.com> Date: Sat, 18 Dec 2021 10:57:54 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 To: Linus Torvalds Cc: Linux Kernel Mailing List , Andrew Morton , Hugh Dickins , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Linux-MM , "open list:KERNEL SELFTEST FRAMEWORK" , "open list:DOCUMENTATION" References: <20211217113049.23850-1-david@redhat.com> <20211217113049.23850-7-david@redhat.com> <9c3ba92e-9e36-75a9-9572-a08694048c1d@redhat.com> <02cf4dcf-74e8-9cbd-ffbf-8888f18a9e8a@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH v1 06/11] mm: support GUP-triggered unsharing via FAULT_FLAG_UNSHARE (!hugetlb) In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspamd-Queue-Id: 1880118003D X-Stat-Signature: 8tc5cu66xooo6uu7r9f7p4njzch8ks6m Authentication-Results: imf24.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=YCmIb0J7; spf=none (imf24.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-Rspamd-Server: rspam10 X-HE-Tag: 1639821476-551310 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 18.12.21 00:20, Linus Torvalds wrote: > On Fri, Dec 17, 2021 at 2:43 PM David Hildenbrand wrote: >> >> The pages stay PageAnon(). swap-backed pages simply set a bit IIRC. >> mapcount still applies. > > Our code-base is too large for me to remember all the details, but if > we still end up having PageAnon for swapbacked pages, then mapcount > can increase from another process faulting in an pte with that swap > entry. "Our code-base is too large for me to remember all the details". I second that. You might a valid point with the mapcount regarding concurrent swapin in the current code, I'll have to think further about that if it could be a problem and if it cannot be handled without heavy synchronization (I think the concern is that gup unsharing could miss doing an unshare because it doesn't detect that there are other page sharers not expressed in the mapcount code but via the swap code when seeing mapcount == 1). Do you have any other concerns regarding the semantics/stability regarding the following points (as discussed, fork() is not the issue because it can be handled via write_protect_seq or something comparable. handling per-process thingies is not the problem): a) Using PageAnon(): It cannot possibly change in the pagefault path or in the gup-fast-only path (otherwise there would be use-after-free already). b) Using PageKsm(): It cannot possibly change in the pagefault path or in the gup-fast path (otherwise there would be use-after-free already). c) Using mapcount: It cannot possibly change in the way we care about or cannot detect (mapcount going from == 1 to > 1 concurrently) in the pagefault path or in the gup-fast path due to fork(). You're point for c) is that we might currently not handle swap correctly. Any other concerns, especially regarding the mapcount or is that it? IIUC, any GUP approach to detect necessary unsharing would at least require a check for a) and b). What we're arguing about is c). > > And mmap_sem doesn't protect against that. Again, page_lock() does. > > And taking the page lock was a big performance issue. > > One of the reasons that new COW handling is so nice is that you can do > things like > > if (!trylock_page(page)) > goto copy; > > exactly because in the a/b world order, the copy case is always safe. > > In your model, as far as I can tell, you leave the page read-only and > a subsequent COW fault _can_ happen, which means that now the > subsequent COW needs to b every very careful, because if it ever > copies a page that was GUP'ed, you just broke the rules. > > So COWing too much is a bug (because it breaks the page from the GUP), > but COWing too little is an even worse problem (because it measn that > now the GUP user can see data it shouldn't have seen). Good summary, I'll extend below. > > Our old code literally COWed too little. It's why all those changes > happened in the first place. Let's see if we can agree on some things to get a common understanding. What can happen with COW is: 1) Missed COW We miss a COW, therefore someone has access to a wrong page. This is the security issue as in patch #11. The security issue documented in [1]. 2) Unnecessary COW We do a COW, but there are no other valid users, so it's just overhead + noise. The performance issue documented in section 5 in [1]. 3) Wrong COW We do a COW but there are other valid users (-> GUP). The memory corruption issue documented in section 2 and 3 in [1]. Most notably, the io_uring reproducer which races with the page_maybe_dma_pinned() check in current code can trigger this easily, and exactly this issues is what gives me nightmares. [2] Does that make sense? If we agree on the above, then here is how the currently discussed approaches differ: page_count != 1: * 1) cannot happen * 2) can happen easily (speculative references due to pagecache, migration, daemon, pagevec, ...) * 3) can happen in the current code mapcount > 1: * 1) your concern is that this can happen due to concurrent swapin * 2) cannot happen. * 3) your concern is that this can happen due to concurrent swapin If we can agree on that, I can see why you dislike mapcount, can you see why I dislike page_count? Ideally we'd really have a fast and reliable check for "is this page shared and could get used by multiple processes -- either multiple processes are already mapping it R/O or could map it via the swap R/O later". > This is why I'm pushing that whole story line of > > (1) COW is based purely on refcounting, because that's the only thing > that obviously can never COW too little. I am completely missing how 2) or 3) could *ever* be handled properly for page_count != 1. 3) is obviously more important and gives me nightmares. And that's what I'm trying to communicate the whole time: page_count is absolutely fragile, because anything that results in a page getting mapped R/O into a page table can trigger 3). And as [2] proves that can even happen with *swap*. (see how we're running into the same swap issues with both approaches? Stupid swap :) ) > > (2) GUP pre-COWs (the thing I called the "(a)" rule earlier) and then > makes sure to not mark pinned pages COW again (that "(b)" rule). > > and here "don't use page_mapcount()" really is about that (1). > > You do seem to have kept (1) in that your COW rules don't seem to > change (but maybe I missed it), but because your GUP-vs-COW semantics > are very different indeed, I'm not at all convinced about (2). Oh yes, sorry, not in the context of this series. The point is that the current page_count != 1 covers mapcount > 1, so we can adjust that separately later. You mentioned "design", so let's assume we have a nice function: /* * Check if an anon page is shared or exclusively used by a single * process: if shared, the page is shared by multiple processes either * mapping the page R/O ("active sharing") or having swap entries that * could result in the page getting mapped R/O ("inactive sharing"). * * This function is safe to be called under mmap_lock in read/write mode * because it prevents concurrent fork() sharing the page. * This function is safe to be called from gup-fast-only in IRQ context, * as it detects concurrent fork() sharing the page */ bool page_anon_shared(); Can we agree that that would that be a suitable function for (1) and (2) instead of using either the page_count or the mapcount directly? (yes, how to actually make it reliable due to swapin is to be discussed, but it might be a problem worth solving if that's the way to go) For hugetlb, this would really have to use the mapcount as explained (after all, fortunately there is no swap ...). [1] https://lore.kernel.org/all/3ae33b08-d9ef-f846-56fb-645e3b9b4c66@redhat.com/ [2] https://gitlab.com/aarcange/kernel-testcases-for-v5.11/-/blob/main/io_uring_swap.c -- Thanks, David / dhildenb