From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 1CD9BC433F5 for ; Tue, 1 Mar 2022 08:24:51 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 4E4B98D0002; Tue, 1 Mar 2022 03:24:50 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 46D7E8D0001; Tue, 1 Mar 2022 03:24:50 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 30E528D0002; Tue, 1 Mar 2022 03:24:50 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.27]) by kanga.kvack.org (Postfix) with ESMTP id 214B28D0001 for ; Tue, 1 Mar 2022 03:24:50 -0500 (EST) Received: from smtpin03.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay11.hostedemail.com (Postfix) with ESMTP id E743C8185B for ; Tue, 1 Mar 2022 08:24:49 +0000 (UTC) X-FDA: 79195131498.03.488A0A4 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 563A11A0009 for ; Tue, 1 Mar 2022 08:24:49 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1646123088; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6OjrXJyKn42EzyN5K6cra49btW5qtOQTOfSL27/rxsI=; b=PrYRUe9nnzI9TUMTtxznfSS9x0ktK4fSKfg5DuVWGCOCRWkWGQyv1toymmaMmFdUM5eDW5 8hbVHKOaP4p/W62V8e+7GXuoy3LDj3iDzGOg3KFCmAiuPXe97rlD/LJdgAbh4FzoA5LI+Z OKvkhv5MZA1xrW6+ZPNFpJi5oI8GFpU= Received: from mail-wr1-f72.google.com (mail-wr1-f72.google.com [209.85.221.72]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-538-AyDPtzISMzuO6EoFWGX3mA-1; Tue, 01 Mar 2022 03:24:44 -0500 X-MC-Unique: AyDPtzISMzuO6EoFWGX3mA-1 Received: by mail-wr1-f72.google.com with SMTP id x15-20020a5d6b4f000000b001ee6c0aa287so2955039wrw.9 for ; Tue, 01 Mar 2022 00:24:44 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=6OjrXJyKn42EzyN5K6cra49btW5qtOQTOfSL27/rxsI=; b=xU8RSTWUg/krqOpecbxBB74xwuuHQ7XC5GKlAc5qE1fIkj7c05w7/zgXgPKrL5rQ4c SwkbDoDiLNYvr7lFiKmvXEBkFhrL+ZIVoXt1if7K4zZr1v4/h4Fr2Msg/dGW9H8qK2i9 4dCzW2uJ0i28ejpFmrF3m6oP1A0cRoxWjS3zB9W/8THdL+rraQ69O2oCkJSQrnFaRmYw 51+vHkgO1o7RGiCM+zUTgNfxoFeIbJt7e5HrcyMWpKi6snhbdjELPfqcJXuAKzVGysy2 +5+4CTe+C6S3x7y5my4nEW9BNZiCfzfJ48iuZ7XtvVQn4puqD0pq217+YJPICektLCLc z+1A== X-Gm-Message-State: AOAM532weeRTs6uB2NHc8UcA0yVxu8ybA+UfqixtseUT3dp0yVVFFlHx 0isWLUTtibfvF0Txob/VZjpwulRIVxCWTtGCS9ty/RRDdVaXEkWIRrfQ5XeNqXdJyRIUYuIusIs 8w8Kg7rHHaBs= X-Received: by 2002:a05:600c:3483:b0:380:edaf:d479 with SMTP id a3-20020a05600c348300b00380edafd479mr15944093wmq.20.1646123083469; Tue, 01 Mar 2022 00:24:43 -0800 (PST) X-Google-Smtp-Source: ABdhPJxMqOpssS8d6tbOinpqRQItUr5VUMVaKyagMbiQGX3iw2D3lOmjyt8RNWbWp14Ett9P3BUI0Q== X-Received: by 2002:a05:600c:3483:b0:380:edaf:d479 with SMTP id a3-20020a05600c348300b00380edafd479mr15944057wmq.20.1646123083147; Tue, 01 Mar 2022 00:24:43 -0800 (PST) Received: from ?IPV6:2003:cb:c70e:5e00:88ce:ad41:cb1b:323? (p200300cbc70e5e0088cead41cb1b0323.dip0.t-ipconnect.de. [2003:cb:c70e:5e00:88ce:ad41:cb1b:323]) by smtp.gmail.com with ESMTPSA id f4-20020a5d4dc4000000b001d8e67e5214sm13194841wru.48.2022.03.01.00.24.41 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 01 Mar 2022 00:24:42 -0800 (PST) Message-ID: <329e01f6-813a-9212-97c9-9440894dcf2c@redhat.com> Date: Tue, 1 Mar 2022 09:24:41 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 To: linux-kernel@vger.kernel.org Cc: Andrew Morton , Hugh Dickins , Linus Torvalds , David Rientjes , Shakeel Butt , John Hubbard , Jason Gunthorpe , Mike Kravetz , Mike Rapoport , Yang Shi , "Kirill A . Shutemov" , Matthew Wilcox , Vlastimil Babka , Jann Horn , Michal Hocko , Nadav Amit , Rik van Riel , Roman Gushchin , Andrea Arcangeli , Peter Xu , Donald Dutile , Christoph Hellwig , Oleg Nesterov , Jan Kara , Liang Zhang , Pedro Gomes , Oded Gabbay , linux-mm@kvack.org, Khalid Aziz References: <20220224122614.94921-1-david@redhat.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH RFC 00/13] mm: COW fixes part 2: reliable GUP pins of anonymous pages In-Reply-To: <20220224122614.94921-1-david@redhat.com> X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam02 X-Rspamd-Queue-Id: 563A11A0009 X-Stat-Signature: ksegyewys75u1xaqrts5wgpqqr3jzski Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=PrYRUe9n; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf19.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1646123089-909116 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 24.02.22 13:26, David Hildenbrand wrote: > This series is the result of the discussion on the previous approach [2]. > More information on the general COW issues can be found there. It is based > on [1], which resides in -mm and -next: > [PATCH v3 0/9] mm: COW fixes part 1: fix the COW security issue for > THP and swap > > I keep the latest state, including some hacky selftest on: > https://github.com/davidhildenbrand/linux/tree/cow_fixes_part_2 > > This series fixes memory corruptions when a GUP pin (FOLL_PIN) was taken > on an anonymous page and COW logic fails to detect exclusivity of the page > to then replacing the anonymous page by a copy in the page table: The > GUP pin lost synchronicity with the pages mapped into the page tables. > > This issue, including other related COW issues, has been summarized in [3] > under 3): > " > 3. Intra Process Memory Corruptions due to Wrong COW (FOLL_PIN) > > page_maybe_dma_pinned() is used to check if a page may be pinned for > DMA (using FOLL_PIN instead of FOLL_GET). While false positives are > tolerable, false negatives are problematic: pages that are pinned for > DMA must not be added to the swapcache. If it happens, the (now pinned) > page could be faulted back from the swapcache into page tables > read-only. Future write-access would detect the pinning and COW the > page, losing synchronicity. For the interested reader, this is nicely > documented in feb889fb40fa ("mm: don't put pinned pages into the swap > cache"). > > Peter reports [8] that page_maybe_dma_pinned() as used is racy in some > cases and can result in a violation of the documented semantics: > giving false negatives because of the race. > > There are cases where we call it without properly taking a per-process > sequence lock, turning the usage of page_maybe_dma_pinned() racy. While > one case (clear_refs SOFTDIRTY tracking, see below) seems to be easy to > handle, there is especially one rmap case (shrink_page_list) that's hard > to fix: in the rmap world, we're not limited to a single process. > > The shrink_page_list() issue is really subtle. If we race with > someone pinning a page, we can trigger the same issue as in the FOLL_GET > case. See the detail section at the end of this mail on a discussion how > bad this can bite us with VFIO or other FOLL_PIN user. > > It's harder to reproduce, but I managed to modify the O_DIRECT > reproducer to use io_uring fixed buffers [15] instead, which ends up > using FOLL_PIN | FOLL_WRITE | FOLL_LONGTERM to pin buffer pages and can > similarly trigger a loss of synchronicity and consequently a memory > corruption. > > Again, the root issue is that a write-fault on a page that has > additional references results in a COW and thereby a loss of > synchronicity and consequently a memory corruption if two parties > believe they are referencing the same page. > " > > This series makes GUP pins (R/O and R/W) on anonymous pages fully reliable, > especially also taking care of concurrent pinning via GUP-fast, > for example, also fully fixing an issue reported regarding NUMA > balancing [4] recently. While doing that, it further reduces "unnecessary > COWs", especially when we don't fork()/KSM and don't swapout, and fixes the > COW security for hugetlb for FOLL_PIN. > > In summary, we track via a pageflag (PG_anon_exclusive) whether a mapped > anonymous page is exclusive. Exclusive anonymous pages that are mapped > R/O can directly be mapped R/W by the COW logic in the write fault handler. > Exclusive anonymous pages that want to be shared (fork(), KSM) first have > to mark a mapped anonymous page shared -- which will fail if there are > GUP pins on the page. GUP is only allowed to take a pin on anonymous pages > that is exclusive. The PT lock is the primary mechanism to synchronize > modifications of PG_anon_exclusive. GUP-fast is synchronized either via the > src_mm->write_protect_seq or via clear/invalidate+flush of the relevant > page table entry. > > Special care has to be taken about swap, migration, and THPs (whereby a > PMD-mapping can be converted to a PTE mapping and we have to track > information for subpages). Besides these, we let the rmap code handle most > magic. For reliable R/O pins of anonymous pages, we need FAULT_FLAG_UNSHARE > logic as part of our previous approach [2], however, it's now 100% mapcount > free and I further simplified it a bit. > > #1 is a fix > #3-#7 are mostly rmap preparations for PG_anon_exclusive handling > #8 introduces PG_anon_exclusive > #9 uses PG_anon_exclusive and make R/W pins of anonymous pages > reliable > #10 is a preparation for reliable R/O pins > #11 and #12 is reused/modified GUP-triggered unsharing for R/O GUP pins > make R/O pins of anonymous pages reliable > #13 adds sanity check when (un)pinning anonymous pages > > I'm not proud about patch #8, suggestions welcome. Patch #9 contains > excessive explanations and the main logic for R/W pins. #11 and #12 > resemble what we proposed in the previous approach [2]. I consider the > general approach of #13 very nice and helpful, and I remember Linus even > envisioning something like that for finding BUGs, although we might want to > implement the sanity checks eventually differently > > It passes my growing set of tests for "wrong COW" and "missed COW", > including the ones in [30 -- I'd really appreciate some experienced eyes > to take a close look at corner cases. Only tested on x86_64, testing with > CONT-mapped hugetlb pages on arm64 might be interesting. > > Once we converted relevant users of FOLL_GET (e.g., O_DIRECT) to FOLL_PIN, > the issue described in [3] under 2) will be fixed as well. Further, once > that's in place we can streamline our COW logic for hugetlb to rely on > page_count() as well and fix any possible COW security issues. Hi, I did excessive tests on aarch64 with CONT hugetlb pages yesterday and didn't find any surprises, at least not in these changes. [1] While thinking on how to get O_DIRECT fixed immediately, I realized the following: (1) This series turns FOLL_PIN on anonymous pages completely reliable, for any case of GUP+fork (GUP before fork, GUP during fork, GUP after fork in child/parent), which is pretty nice IMHO. (2) For O_DIRECT and friends (FOLL_GET) we primarily care about fixing short-term FOLL_WRITE *without* fork(), which are the memory corruptions we're experiencing. Long-term FOLL_GET (meaning, even staying reliable after fork()) already has a bad smell to it. Especially, (2) never worked reliably with fork() involved. I came to the conclusion that this series pretty much fixes (2) already *except* the swapout case. I might be wrong, but I think it should be possible to handle that as well, meaning that: a FOLL_GET|FOLL_WRITE on an anonymous page will be reliable as long as we don't fork(). I'm planning on sending a part3 to cover that, so we don't have to wait for the FOLL_PIN conversion to get our O_DIRECT reproducers fixed. Stay tuned. If there isn't any more feedback, I'll be sending out a v1 soonish. [1] https://lkml.kernel.org/r/811c5c8e-b3a2-85d2-049c-717f17c3a03a@redhat.com -- Thanks, David / dhildenb