From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D03F9CEFC39 for ; Tue, 8 Oct 2024 17:52:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 458906B0093; Tue, 8 Oct 2024 13:52:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 407566B0096; Tue, 8 Oct 2024 13:52:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2CF6B6B0098; Tue, 8 Oct 2024 13:52:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0012.hostedemail.com [216.40.44.12]) by kanga.kvack.org (Postfix) with ESMTP id 0B0E86B0093 for ; Tue, 8 Oct 2024 13:52:12 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay04.hostedemail.com (Postfix) with ESMTP id 4789A1A0139 for ; Tue, 8 Oct 2024 17:52:09 +0000 (UTC) X-FDA: 82651178862.12.B96B304 Received: from mail-qt1-f174.google.com (mail-qt1-f174.google.com [209.85.160.174]) by imf12.hostedemail.com (Postfix) with ESMTP id 82AF440006 for ; Tue, 8 Oct 2024 17:52:09 +0000 (UTC) Authentication-Results: imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=AzA7m7Ug; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of surenb@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1728409819; a=rsa-sha256; cv=none; b=ERBMuatxjK2Qc7HT8tUCLaSvAvw9pRhKf55xCQIG2bvAChP20sxz8siJzcDZHtJTBqoVzz 0HG7JX/9E6hnt0G+ewev4XCNcr3REECqnJDB8naHkXBotuLBKcfDSy6Q721DihFbQ6nB/1 wyOU5+qd9XV2s2Z9n6qcedy1vEdDnBw= ARC-Authentication-Results: i=1; imf12.hostedemail.com; dkim=pass header.d=google.com header.s=20230601 header.b=AzA7m7Ug; dmarc=pass (policy=reject) header.from=google.com; spf=pass (imf12.hostedemail.com: domain of surenb@google.com designates 209.85.160.174 as permitted sender) smtp.mailfrom=surenb@google.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1728409819; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=3bYdSo/JSRjS9IGPMXSaJSzGPxeAQS0+usN1NkUtDUk=; b=fxxGCEMMXobtdGW42bXHjdo6YozoYdmGQVzJ+ynk6w8YlkkF7SoZMZsJ4SIgriK3AOxoja m39UoGHu2AISWjId/YzH9JSRH6ROil/RNC9HmJOKRovVIemEe/nirC7oTerE71JyImPP2V 6Z+T4VWsh+aszH8c/Ok60ir9dvGP4tM= Received: by mail-qt1-f174.google.com with SMTP id d75a77b69052e-4581cec6079so28111cf.0 for ; Tue, 08 Oct 2024 10:52:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20230601; t=1728409928; x=1729014728; darn=kvack.org; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=3bYdSo/JSRjS9IGPMXSaJSzGPxeAQS0+usN1NkUtDUk=; b=AzA7m7UgJnMt95gqiK/RA02OfDa3MLvT4ktliY3xyL3vRrNDSySrZdKSZDDYmR0tq3 RWBd+RCol8ptr5DLQh+FssBKoYFURi4boejJyXtC9S+CTLUW2qusKK96sVyXKWiCuMqJ jU7TLXpTKvSVRGbheqHIklD6OW0kbAH7BdL9YHG0yjMmb5Du7d5xt7NUsgpyE2sF6Bdq LRF+bYCgZuzlHpP6yAJRH3hhQleXn/ug2dYxg8qacnQ216pSpacnWxrKBl6fC7+sVwEb 4lVyqKST+u30nZWsmvmM+JmqmfLAdwbqDewG2C4xxay6CwNlmgXJdK5YDsyh4Tzgmx60 srbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1728409928; x=1729014728; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=3bYdSo/JSRjS9IGPMXSaJSzGPxeAQS0+usN1NkUtDUk=; b=lgkzXlZbfp3nl9R+oLQ146OmSxTTlnDMWcJAys68hg7fgFHSgvU2dEN1Ts+YFJTYLk +HjoH5CMhzqQHKj5fdRfBRSpba15sGAvMCWFmSCcqSaBarzntLzkTEuRk9eqMIpXBqCA 4jCWtEEn3VVXXj4/FWI9BvJ0+Zt2k3i2eJrTtjAoUDiwHNAbYrebfK1diB517Ex+5wog Qm9MhGsqAGLNWVlfHfVy/h4/GR7ikaFSfYJUett9FCM/ccunrvHjyapROPtlUo5MaEfa s0HKgT0uuIj0hASIlXOXdhoQEF2OTiNQL2VQShyEI5cm/7tjevHhFSPoiWwjeCjb9WyR TbpA== X-Forwarded-Encrypted: i=1; AJvYcCUqgqgIeNmTRTAQlG92sRXQviOE1oI2Mu7pwi7KfZdLbo72iEY9c+Yjp0v+gEFbGxkTr8c5Q2G8Hw==@kvack.org X-Gm-Message-State: AOJu0YzjKmwsZZHzOYG4PQLdLMuo5+xXyBmROAZ7Zifj0BnjEhxehZtj CB8jWXviHC/grksFlLCBkroa7k2mOuyIzzxszua2iioFAxCV5b25ixzwlpR8rv2m6UcEpn7wNWh BmATjhm84aMNfbHmscskSRSvCi/IEvEuzZPpQ X-Google-Smtp-Source: AGHT+IFQIWs+lkw3S2POV4bbdPyWnNXIWluyz0TnAFQ7Q8jb5ITvNTC7yJSYO0ISx/jyccQ6Pgi/igF+FWjc5M+Rmsw= X-Received: by 2002:a05:622a:2987:b0:45d:8da8:63de with SMTP id d75a77b69052e-45ec7e1fd73mr3559321cf.17.1728409928196; Tue, 08 Oct 2024 10:52:08 -0700 (PDT) MIME-Version: 1.0 References: <20240830040101.822209-1-Liam.Howlett@oracle.com> <20240830040101.822209-15-Liam.Howlett@oracle.com> In-Reply-To: From: Suren Baghdasaryan Date: Tue, 8 Oct 2024 10:51:57 -0700 Message-ID: Subject: Re: [BUG] page table UAF, Re: [PATCH v8 14/21] mm/mmap: Avoid zeroing vma tree in mmap_region() To: Jann Horn Cc: "Liam R. Howlett" , Andrew Morton , Lorenzo Stoakes , Linux-MM , kernel list , Matthew Wilcox , Vlastimil Babka , Sidhartha Kumar , Bert Karwatzki , Jiri Olsa , Kees Cook , "Paul E . McKenney" , Jeff Xu , Seth Jenkins Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 82AF440006 X-Stat-Signature: dn1byz9p994ef8spnrgju7xgdkqruze7 X-Rspam-User: X-HE-Tag: 1728409929-964123 X-HE-Meta: U2FsdGVkX1/tLrI7aDcKyaW42RxskYBVVn98vvx9w9J9aggN76N6/qk3veiuBrAdfX0K3w2eIcj/LWsAs1syaFbeHNmXOSI+vOlMpAaOE2g860b1+18ZAaE4xh23+IEw43n7D9FHUCw/RyRYPGGgOJDgntj1V0U2k+F8rElVs3rntbBbALlC+C89iAFdUS8bPVL7Tb+qqXzBSr+TI/ECAi+cQjjjAxtygZMOLfXoOvb+ccqaJ6x5QuhaKSd2uu5gFePunuOtwl459/C0vnlHzsOl4jhXk9ngLTVbDQv+0AP0o07Y0qPLdmHwgDbvt1D6ULgtX49XjKDFmETpfPADDzdbFf95ft4O0ZoWeTlbAgoK62Sw2QFe67/UfKv12FAQiE/RasIkVGly+viNuoHR1+VMRcc8/OeO2+4R0QrHPh7hBiYD9G/00HxPflxmJJbJrJpe0IdN4yX1NP6weTmdRcw+EGQtXTZgdZNEpZKPFuowggIztcExau211eOlyLfhNTpdhSMD/qP3Hffj6Rw4cY8+0OAzC3VmiyF3s9hGxnidx5P/ChZ1W7PqA6CfY+iEpwWtdqQtufUgPL31FoIorbO/zv/EPE/j7GGwtRxfX2QfHbRFdsa+jycQVzafG0DtMeKouxjs2QEYf0R96SA2glg+eKCkdPxR1lCnEZbuDv9ZqA3DPVaLCAIGgkBr8/DifGSfi2cKJmnnogHFZ6OuT1joMbNDDyjpt7UR1S4PPU4X1guIYNG7mghgSJAuymrFGjeAOX/BQG1vmqf//hSKsvVWjQyqq6I0vSk/9oVKhPUMItwy6pzKwaElHWXucMq3dwf+1sycukvhzGYRUZtE1Fi4vDun82jiBLAWn6bpyvhH3Qk4hyWV8OMMU1iGYdWJoCgG4U5XKVrL4H6W2/p43Vb96jU3xEc0/aecjR2i/UnQCvOenq7UNm0auQYkLyCK/F0t3Hzvzmq74W0++if bcAXj9B+ g+Q8KfWK12CUWnqznf5gCQ3eCSa3M6qeNz6tFj3ugzTJGRjS98elscH2uFs85+2cUJzj52taHJb9q71qe7Kpki8Gp4OankXeXayioX2G2sK2OjcFAi7xqFHEH/tnuq2X4B/FbNL2dZOndzxqji7HL3HqUMMNwEKVqy7KIEaqm0sle2CRSbOBjxzhTSPsASyVB98i8a7LYceKTp6pGs5vi9hwxe+f4IV1VFlE5dtubesJbffTEGXsskynKw1u8KrKwosus+enHNpG9TT8FJU4pUYKqjaTk/3pIUr7RMlGhH3SiJFVM7cygF5yrQQ== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Tue, Oct 8, 2024 at 10:16=E2=80=AFAM Jann Horn wrote: > > On Tue, Oct 8, 2024 at 3:51=E2=80=AFAM Liam R. Howlett wrote: > > * Jann Horn [241007 17:31]: > > > On Mon, Oct 7, 2024 at 10:31=E2=80=AFPM Liam R. Howlett wrote: > > > > * Jann Horn [241007 15:06]: > > > > > On Fri, Aug 30, 2024 at 6:00=E2=80=AFAM Liam R. Howlett wrote: > > > > > > Instead of zeroing the vma tree and then overwriting the area, = let the > > > > > > area be overwritten and then clean up the gathered vmas using > > > > > > vms_complete_munmap_vmas(). > > > > > > > > > > > > To ensure locking is downgraded correctly, the mm is set regard= less of > > > > > > MAP_FIXED or not (NULL vma). > > > > > > > > > > > > If a driver is mapping over an existing vma, then clear the pte= s before > > > > > > the call_mmap() invocation. This is done using the vms_clean_u= p_area() > > > > > > helper. If there is a close vm_ops, that must also be called t= o ensure > > > > > > any cleanup is done before mapping over the area. This also me= ans that > > > > > > calling open has been added to the abort of an unmap operation,= for now. > > > > > > > > > > As currently implemented, this is not a valid optimization becaus= e it > > > > > violates the (unwritten?) rule that you must not call free_pgd_ra= nge() > > > > > on a region in the page tables which can concurrently be walked. = A > > > > > region in the page tables can be concurrently walked if it overla= ps a > > > > > VMA which is linked into rmaps which are not write-locked. > > > > > > > > Just for clarity, this is the rmap write lock. > > > > > > Ah, yes. > > > > > > > > On Linux 6.12-rc2, when you mmap(MAP_FIXED) over an existing VMA,= and > > > > > the new mapping is created by expanding an adjacent VMA, the foll= owing > > > > > race with an ftruncate() is possible (because page tables for the= old > > > > > mapping are removed while the new VMA in the same location is alr= eady > > > > > fully set up and linked into the rmap): > > > > > > > > > > > > > > > task 1 (mmap, MAP_FIXED) task 2 (ftruncate) > > > > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > > > > mmap_region > > > > > vma_merge_new_range > > > > > vma_expand > > > > > commit_merge > > > > > vma_prepare > > > > > [take rmap locks] > > > > > vma_set_range > > > > > [expand adjacent mapping] > > > > > vma_complete > > > > > [drop rmap locks] > > > > > vms_complete_munmap_vmas > > > > > vms_clear_ptes > > > > > unmap_vmas > > > > > [removes ptes] > > > > > free_pgtables > > > > > [unlinks old vma from rmap] > > > > > unmap_mapping_range > > > > > unmap_mapping_pages > > > > > i_mmap_lock_read > > > > > unmap_mapping_range_tree > > > > > [loop] > > > > > unmap_mapping_range_vma > > > > > zap_page_range_single > > > > > unmap_single_vma > > > > > unmap_page_range > > > > > zap_p4d_range > > > > > zap_pud_range > > > > > zap_pmd_range > > > > > [looks up pmd = entry] > > > > > free_pgd_range > > > > > [frees pmd] > > > > > [UAF pmd entry= access] > > > > > > > > > > To reproduce this, apply the attached mmap-vs-truncate-racewiden.= diff > > > > > to widen the race windows, then build and run the attached reprod= ucer > > > > > mmap-fixed-race.c. > > > > > > > > > > Under a kernel with KASAN, you should ideally get a KASAN splat l= ike this: > > > > > > > > Thanks for all the work you did finding the root cause here, I > > > > appreciate it. > > > > > > Ah, this is not a bug I ran into while testing, it's a bug I found > > > while reading the patch. It's much easier to explain the issue and > > > come up with a nice reproducer this way than when you start out from = a > > > crash. :P > > > > > > > I think the correct fix is to take the rmap lock on free_pgtables, = when > > > > necessary. There are a few code paths (error recovery) that are no= t > > > > regularly run that will also need to change. > > > > > > Hmm, yes, I guess that might work. Though I think there might be more > > > races: One related aspect of this optimization that is unintuitive to > > > me is that, directly after vma_merge_new_range(), a concurrent rmap > > > walk could probably be walking the newly-extended VMA but still > > > observe PTEs belonging to the previous VMA. I don't know how robust > > > the various rmap walks are to things like encountering pfnmap PTEs in > > > non-pfnmap VMAs, or hugetlb PUD entries in non-hugetlb VMAs. For > > > example, page_vma_mapped_walk() looks like, if you called it on a pag= e > > > table range with huge PUD entries, but with a VMA without VM_HUGETLB, > > > something might go wrong on the "pmd_offset(pud, pvmw->address)" call= , > > > and a 1G hugepage might get misinterpreted as a page table? But I > > > haven't experimentally verified that. > > > > Yes, I am also concerned that reacquiring the lock will result in > > another race. I also don't think holding the lock for longer is a good > > idea as it will most likely cause a regression by extending the lock fo= r > > the duration of the mmap() setup. Although, maybe it would be fine if > > we only keep it held if we are going to be removing a vma in the > > MAP_FIXED case. > > I guess you could have a separate seqcount on the mm for "there might > be temporary holes in the tree", or temporarily store some markers in > the maple tree that say "there is nothing here right now, but if you > want to see a full picture while iterating, you have to try again > later"? > > Or you could basically unmap the VMA while it is still in the VMA tree > but is already locked and marked as detached? So first you do > unmap_vmas() and free_pgtables() (which clears the PTEs, removes the > rmap links, and deletes page tables), then prepare the new VMAs, and > then replace the old VMA's entries in the VMA tree with the new > entries? I guess in the end the result would semantically be pretty > similar to having markers in the maple tree. > > > Another idea would be to change the pte to know if a vma is being > > modified using the per-vma locking, but I'm not sure what action to tak= e > > if we detect the vma is being modified to avoid the issue. This would > > also need to be done to all walkers (or low enough in the stack). > > Sorry, can you clarify what pte the "change the pte" is referring to? > > > By the way, this isn't an optimisation; this is to fix RCU walkers of > > the vma tree seeing a hole between the underlying implementation of the > > MAP_FIXED operations of munmap() and mmap(). This is needed for things > > like the /proc/{pid}/maps rcu walker. The page tables currently fall > > back to the old way of locking if a hole is seen (and sane applications > > shouldn't really be page faulting something being removed anyways..) > > Is that code in a tree somewhere? > > What locking will those RCU walkers use when accessing VMAs? I guess > they probably anyway have to take the VMA locks to ensure they see > consistent state, though I guess with enough effort you could avoid it > (seqlock-style) on some fastpaths when the vma is not concurrently > modified and the fastpath doesn't need access to the VMA's file? Sorry, it's not posted upstream yet but yes, the idea is to walk the tree under RCU and detect concurrent changes using seq-counters. A prototype was posted here: https://lore.kernel.org/all/20240123231014.3801041-3-surenb@google.com/ but it had some issues I'm yet to resolve.