From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id C3711C369D5 for ; Wed, 25 Sep 2024 11:47:04 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 523C46B0089; Wed, 25 Sep 2024 07:47:04 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 4D33B6B008C; Wed, 25 Sep 2024 07:47:04 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 373536B0099; Wed, 25 Sep 2024 07:47:04 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0013.hostedemail.com [216.40.44.13]) by kanga.kvack.org (Postfix) with ESMTP id 18C656B0089 for ; Wed, 25 Sep 2024 07:47:04 -0400 (EDT) Received: from smtpin14.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id BD04280236 for ; Wed, 25 Sep 2024 11:47:03 +0000 (UTC) X-FDA: 82603084326.14.57E45B7 Received: from mail-ej1-f52.google.com (mail-ej1-f52.google.com [209.85.218.52]) by imf27.hostedemail.com (Postfix) with ESMTP id A5A7C40024 for ; Wed, 25 Sep 2024 11:47:01 +0000 (UTC) Authentication-Results: imf27.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b="SrLsRvi/"; dmarc=none; spf=none (imf27.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.218.52) smtp.mailfrom=simona.vetter@ffwll.ch ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1727264806; a=rsa-sha256; cv=none; b=G3w9Qhgi15gRKctEcPd3MmmFtI/ghGgsKt0Hvan5U5+f75CBVEn/Xz1Nsk0BNaPmiPqhAn vDz+8pYyE/y+gG0csL6zjcgTQNG6Q5lyT2UoDBFmTOPIqevBpHcbscIxwQY/w7fDoV7AJU /ZJXNCMKV/uwna6dCRBMpuF/95gW4Ks= ARC-Authentication-Results: i=1; imf27.hostedemail.com; dkim=pass header.d=ffwll.ch header.s=google header.b="SrLsRvi/"; dmarc=none; spf=none (imf27.hostedemail.com: domain of simona.vetter@ffwll.ch has no SPF policy when checking 209.85.218.52) smtp.mailfrom=simona.vetter@ffwll.ch ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1727264806; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=2Tj8OK7kAZuX6f62M6xEEAJ7JItwlIyoheBJHCTZORU=; b=SEx66nleNzY/fzz5FArxm0Ej5GvQSwjx9nmDZm3Ti53I7afIU6bAwLwCizWAqE93U9g6Tm JsNYDiJKHbS7/OplhYnHaWR0ykJYaPEcvoPTyC3Ppht196n8GAdLjDQAY3BMoAg3tYxkJH MP2IcYTcse6tCVcfS6m/3uaIVe9MGuc= Received: by mail-ej1-f52.google.com with SMTP id a640c23a62f3a-a90349aa7e5so969585066b.0 for ; Wed, 25 Sep 2024 04:47:01 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; t=1727264820; x=1727869620; darn=kvack.org; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:mail-followup-to:message-id:subject:cc:to :from:date:from:to:cc:subject:date:message-id:reply-to; bh=2Tj8OK7kAZuX6f62M6xEEAJ7JItwlIyoheBJHCTZORU=; b=SrLsRvi/i6m1IXcO3AYeO4z3WzyvkMLrMhN4mFXqXa0ympSr3JyQbNJSPErXwnsKo6 CsH0ME427sNDQFexNyJk6eLckdceB42acftzk4WgUXVYbd7o4mXf9dN1TcQwJ5OqP8k9 A1Nfh8NCxd+jdrzoyu9ZEARM3SNee6mI31XFM= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1727264820; x=1727869620; h=in-reply-to:content-transfer-encoding:content-disposition :mime-version:references:mail-followup-to:message-id:subject:cc:to :from:date:x-gm-message-state:from:to:cc:subject:date:message-id :reply-to; bh=2Tj8OK7kAZuX6f62M6xEEAJ7JItwlIyoheBJHCTZORU=; b=ql5U2FoXEpNv4ZWJOs4rkCxWVEykXANC7T6BFSPzD4FaBMcJQjcDXWaMz+0aoCrU3d ggbDCQQLwwOD6ImgpTXbDQnFpzN3tAhIkXeZHwubdqBGqK7ohMuV/Oq3yeaxC9lb+VHW FjKo4AuvMuV3h4fnTkiywj+YA3U5x9fn6GKYVkUcUUGce0FoJZMLPeWVGiHr3YZiwkrI mKWNCn8JqIaMBFzdeg8/pernT7KxmdRuKVhgDVx/3/yRT8o7nVDRvf70JvecxYVAxJx4 WhGrtLagNOyxU1YWQBkuo8OzxmYseEZi1o6MGZCtaj8yNSadwE+5vmhoQmXlASWl3XMe xmng== X-Forwarded-Encrypted: i=1; AJvYcCVn1jUWwWaLJV48Aq0JBj5PQM4HHa4IfzSCzHzjzWatalHmFMLrmOf0pM03ZvskswaSTzmQYaxH1Q==@kvack.org X-Gm-Message-State: AOJu0YwB4QBWfQf5sd2rZhHG1+fclR+zjNiH0VxbsKh5AmSPDdytOLrz HayGJh6FPygisQevx+sUdl53KXGx0I6u1lW/qOXyAwhO/dgqrVUu6+EOHQcCuGM= X-Google-Smtp-Source: AGHT+IHA+kTMtzXW+qfkLE7ljWb+OA36okKl+ArJf1qTHHurl9kG5RKfiHFN0x1MY0ci0Ts+ytgVvw== X-Received: by 2002:a17:906:c144:b0:a8a:5bfb:7d1a with SMTP id a640c23a62f3a-a93a03c10a8mr226914366b.35.1727264820005; Wed, 25 Sep 2024 04:47:00 -0700 (PDT) Received: from phenom.ffwll.local ([2a02:168:57f4:0:5485:d4b2:c087:b497]) by smtp.gmail.com with ESMTPSA id a640c23a62f3a-a9392f54691sm199912266b.85.2024.09.25.04.46.59 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 25 Sep 2024 04:46:59 -0700 (PDT) Date: Wed, 25 Sep 2024 13:46:57 +0200 From: Simona Vetter To: Alistair Popple Cc: Matthew Brost , intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, simona.vetter@ffwll.ch, Philip.Yang@amd.com, akpm@linux-foundation.org, felix.kuehling@amd.com, christian.koenig@amd.com Subject: Re: [PATCH 1/1] mm/migrate: Trylock device page in do_swap_page Message-ID: Mail-Followup-To: Alistair Popple , Matthew Brost , intel-xe@lists.freedesktop.org, dri-devel@lists.freedesktop.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org, Philip.Yang@amd.com, akpm@linux-foundation.org, felix.kuehling@amd.com, christian.koenig@amd.com References: <20240911030337.870160-1-matthew.brost@intel.com> <20240911030337.870160-2-matthew.brost@intel.com> <87mskehjtc.fsf@nvdebian.thelocal> MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Disposition: inline Content-Transfer-Encoding: 8bit In-Reply-To: <87mskehjtc.fsf@nvdebian.thelocal> X-Operating-System: Linux phenom 6.10.6-amd64 X-Rspam-User: X-Stat-Signature: 6o39jxoxcjqqxqb7u7wd1aionzyokmf3 X-Rspamd-Queue-Id: A5A7C40024 X-Rspamd-Server: rspam02 X-HE-Tag: 1727264821-657809 X-HE-Meta: U2FsdGVkX1+8SA3yuNSfuu92tuP4FQ9xqE9aEOI4UOCplV8kKmPWnQ66Hcbaa2iNeGySYafwnNNuyTla5abkDk4ZTkx/ozcmHeU9l7piEWdRjNvhVA97eoQbX2Ul/z5suT8mXyYdsqSF9cTX4Nu1HKHVaEzSR0HuE8cgl4KS/gt07KuSyAmDppPAhDoUokQ0jujQl0U2GCXX48rOCgpD7/qbXveSa+XJr0rj+Dndt0qFloVSYY8Yc2YvmHVOCptaWdVPGuTxtamSf1eUHMSV/I3O5YmHcrEIS1gw2XJh8f+bfS2zP84e04eyivtUqDgOCzv4T/z9pndJSuH8yVETB3VleAlNXoYhiuUk2k57vbCP0do+ZYTFwKGLeETbdigUhj6VI9HbJkUOX3O+s8+lo8GWc5PM1lSkxcrrw+oCswbQsvgy0e3/5Qg5oUYId7zdpwCjIQqmYFrDj5vdblRnPmgwEaU9PSjDQIbuKwZcoyZNd/rOwfYosKVOEVhgfcYOAQoSnvd9FE81RlEfg5xt/hsM5g3Fk8hfMJLwBSQTl+aO4yjdo6HLWHFGc2iTLqjMxEoa47yIKPN1nVxV5LRivge06h4uWsMYkwBRS1WynKqKv8IIY7WmdzsVVfwKbuvYty5nUXa1fYQnVR58msNTxCOHNC4/D/0mAj7Ak/f3CPm1Wvg2if25PTNhsD4JQU2y1Z9Tw26pbgSVeob97MwYeyiE1pFFm5OTXC97JnUvN6/RWDrGaGpFnr7wBd2QZwVHDRbB7tJxNiTxJsbMGBaGF9pIZTLnNsRaqRNI3/NagIgeR0n9vpEeCIF2kJWYW1hw9HcRI9RjdX598IQ3pFgGpyRizMGvc1rmU7Iap+8y+xvpV681CdArOWujkpGw94/NLpOh+irW0GwztUaOeOgHwQNrK25tLEg8Z5+9gJkFG05z2jePg7sP6SWJkKx/dK/7Dpy3viStnffiqNh6vD5 Y2bI491S lI4VZqGqDUDkl2gKJ03vSs5FNRSp0XdQs1vHsGa+8d97kcarKCBrCCUkiqnyy03nTajC/7lm1uaEO/SsWBS7L+wXIJca2KF64o2ouXSphzHo7N7Z7A6b6ZjjlemH4utXWg+vwDuKCgz3zOA1iwW7B8G/nC2uhLsCBRuG7JFUmMN6TewV6bUhVB8vHPpxkVrgjU/2zLPwYBTrTHpIFYndG41pyNrs5zkrAeji74wxaeP9hv58M41h7vRhE2j99xKIUpOlIo6ipK9wfK+r8w8YMNU8YozJ10bDmqNtOkVUWL6ap3HnxFas8nv6HeG6McR5wLR3BPRnqK079A0kYmXKAeodw05bPi30Y+JlEoWCMNcVxxCu+Ee4bvr5kkst/A+kv3AHAYEvPb7QxLAHze+Kih268gK2l7drZ529C5NKh3U+UU32SRyOKtMcF/WzH+kKdtUrAxONdazK09r3lKTqLhbLpWOYXUCI1Z99s4VITxDpVHl1w6nj9xPMMFDviB+cw85alq0Ky3ameQFM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On Wed, Sep 11, 2024 at 02:53:31PM +1000, Alistair Popple wrote: > > Matthew Brost writes: > > > Avoid multiple CPU page faults to the same device page racing by locking > > the page in do_swap_page before taking an additional reference to the > > page. This prevents scenarios where multiple CPU page faults each take > > an extra reference to a device page, which could abort migration in > > folio_migrate_mapping. With the device page locked in do_swap_page, the > > migrate_vma_* functions need to be updated to avoid locking the > > fault_page argument. > > I added the get_page() and therefore the fault_page argument to deal > with another lifetime issue. Neither Felix nor I particularly liked the > solution though (see > https://lore.kernel.org/all/cover.60659b549d8509ddecafad4f498ee7f03bb23c69.1664366292.git-series.apopple@nvidia.com/T/#m715589d57716448386ff9af41da27a952d94615a) > and this seems to make it even uglier, so I have suggested an > alternative solution below. Just an aside, I don't fine the fault_page special handling all that offensive, because fundamentally fault_page _is_ special: It's the one page we _really_ have to migrate or userspace blows up because the kernel sucks. But maybe we shold stuff it into the vmf struct instead and pass that around? I think that would be conceptually a notch cleaner. -Sima > > > Prior to this change, a livelock scenario could occur in Xe's (Intel GPU > > DRM driver) SVM implementation if enough threads faulted the same device > > page. > > I haven't seen the same in the NVIDIA UVM driver (out-of-tree, I know) > but theoretically it seems like it should be possible. However we > serialize migrations of the same virtual address range to avoid these > kind of issues as they can happen the other way too (ie. multiple > threads trying to migrate to GPU). > > So I suspect what happens in UVM is that one thread wins and installs > the migration entry while the others fail to get the driver migration > lock and bail out sufficiently early in the fault path to avoid the > live-lock. > > > Cc: Philip Yang > > Cc: Felix Kuehling > > Cc: Christian König > > Cc: Andrew Morton > > Suggessted-by: Simona Vetter > > Signed-off-by: Matthew Brost > > --- > > mm/memory.c | 13 +++++++--- > > mm/migrate_device.c | 60 +++++++++++++++++++++++++++++++-------------- > > 2 files changed, 50 insertions(+), 23 deletions(-) > > > > diff --git a/mm/memory.c b/mm/memory.c > > index 3c01d68065be..bbd97d16a96a 100644 > > --- a/mm/memory.c > > +++ b/mm/memory.c > > @@ -4046,10 +4046,15 @@ vm_fault_t do_swap_page(struct vm_fault *vmf) > > * Get a page reference while we know the page can't be > > * freed. > > */ > > - get_page(vmf->page); > > - pte_unmap_unlock(vmf->pte, vmf->ptl); > > - ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); > > - put_page(vmf->page); > > + if (trylock_page(vmf->page)) { > > + get_page(vmf->page); > > + pte_unmap_unlock(vmf->pte, vmf->ptl); > > This is all beginning to look a lot like migrate_vma_collect_pmd(). So > rather than do this and then have to pass all this context > (ie. fault_page) down to the migrate_vma_* functions could we instead > just do what migrate_vma_collect_pmd() does here? Ie. we already have > the PTL and the page lock so there's no reason we couldn't just setup > the migration entry prior to calling migrate_to_ram(). > > Obviously calling migrate_vma_setup() would show the page as not > migrating, but drivers could easily just fill in the src_pfn info after > calling migrate_vma_setup(). > > This would eliminate the whole fault_page ugliness. > > > + ret = vmf->page->pgmap->ops->migrate_to_ram(vmf); > > + put_page(vmf->page); > > + unlock_page(vmf->page); > > + } else { > > + pte_unmap_unlock(vmf->pte, vmf->ptl); > > + } > > } else if (is_hwpoison_entry(entry)) { > > ret = VM_FAULT_HWPOISON; > > } else if (is_pte_marker_entry(entry)) { > > diff --git a/mm/migrate_device.c b/mm/migrate_device.c > > index 6d66dc1c6ffa..049893a5a179 100644 > > --- a/mm/migrate_device.c > > +++ b/mm/migrate_device.c > > @@ -60,6 +60,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > > struct mm_walk *walk) > > { > > struct migrate_vma *migrate = walk->private; > > + struct folio *fault_folio = migrate->fault_page ? > > + page_folio(migrate->fault_page) : NULL; > > struct vm_area_struct *vma = walk->vma; > > struct mm_struct *mm = vma->vm_mm; > > unsigned long addr = start, unmapped = 0; > > @@ -88,11 +90,13 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > > > > folio_get(folio); > > spin_unlock(ptl); > > - if (unlikely(!folio_trylock(folio))) > > + if (unlikely(fault_folio != folio && > > + !folio_trylock(folio))) > > return migrate_vma_collect_skip(start, end, > > walk); > > ret = split_folio(folio); > > - folio_unlock(folio); > > + if (fault_folio != folio) > > + folio_unlock(folio); > > folio_put(folio); > > if (ret) > > return migrate_vma_collect_skip(start, end, > > @@ -192,7 +196,7 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > > * optimisation to avoid walking the rmap later with > > * try_to_migrate(). > > */ > > - if (folio_trylock(folio)) { > > + if (fault_folio == folio || folio_trylock(folio)) { > > bool anon_exclusive; > > pte_t swp_pte; > > > > @@ -204,7 +208,8 @@ static int migrate_vma_collect_pmd(pmd_t *pmdp, > > > > if (folio_try_share_anon_rmap_pte(folio, page)) { > > set_pte_at(mm, addr, ptep, pte); > > - folio_unlock(folio); > > + if (fault_folio != folio) > > + folio_unlock(folio); > > folio_put(folio); > > mpfn = 0; > > goto next; > > @@ -363,6 +368,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns, > > unsigned long npages, > > struct page *fault_page) > > { > > + struct folio *fault_folio = fault_page ? > > + page_folio(fault_page) : NULL; > > unsigned long i, restore = 0; > > bool allow_drain = true; > > unsigned long unmapped = 0; > > @@ -427,7 +434,8 @@ static unsigned long migrate_device_unmap(unsigned long *src_pfns, > > remove_migration_ptes(folio, folio, false); > > > > src_pfns[i] = 0; > > - folio_unlock(folio); > > + if (fault_folio != folio) > > + folio_unlock(folio); > > folio_put(folio); > > restore--; > > } > > @@ -536,6 +544,8 @@ int migrate_vma_setup(struct migrate_vma *args) > > return -EINVAL; > > if (args->fault_page && !is_device_private_page(args->fault_page)) > > return -EINVAL; > > + if (args->fault_page && !PageLocked(args->fault_page)) > > + return -EINVAL; > > > > memset(args->src, 0, sizeof(*args->src) * nr_pages); > > args->cpages = 0; > > @@ -799,19 +809,13 @@ void migrate_vma_pages(struct migrate_vma *migrate) > > } > > EXPORT_SYMBOL(migrate_vma_pages); > > > > -/* > > - * migrate_device_finalize() - complete page migration > > - * @src_pfns: src_pfns returned from migrate_device_range() > > - * @dst_pfns: array of pfns allocated by the driver to migrate memory to > > - * @npages: number of pages in the range > > - * > > - * Completes migration of the page by removing special migration entries. > > - * Drivers must ensure copying of page data is complete and visible to the CPU > > - * before calling this. > > - */ > > -void migrate_device_finalize(unsigned long *src_pfns, > > - unsigned long *dst_pfns, unsigned long npages) > > +static void __migrate_device_finalize(unsigned long *src_pfns, > > + unsigned long *dst_pfns, > > + unsigned long npages, > > + struct page *fault_page) > > { > > + struct folio *fault_folio = fault_page ? > > + page_folio(fault_page) : NULL; > > unsigned long i; > > > > for (i = 0; i < npages; i++) { > > @@ -838,7 +842,8 @@ void migrate_device_finalize(unsigned long *src_pfns, > > src = page_folio(page); > > dst = page_folio(newpage); > > remove_migration_ptes(src, dst, false); > > - folio_unlock(src); > > + if (fault_folio != src) > > + folio_unlock(src); > > > > if (is_zone_device_page(page)) > > put_page(page); > > @@ -854,6 +859,22 @@ void migrate_device_finalize(unsigned long *src_pfns, > > } > > } > > } > > + > > +/* > > + * migrate_device_finalize() - complete page migration > > + * @src_pfns: src_pfns returned from migrate_device_range() > > + * @dst_pfns: array of pfns allocated by the driver to migrate memory to > > + * @npages: number of pages in the range > > + * > > + * Completes migration of the page by removing special migration entries. > > + * Drivers must ensure copying of page data is complete and visible to the CPU > > + * before calling this. > > + */ > > +void migrate_device_finalize(unsigned long *src_pfns, > > + unsigned long *dst_pfns, unsigned long npages) > > +{ > > + return __migrate_device_finalize(src_pfns, dst_pfns, npages, NULL); > > +} > > EXPORT_SYMBOL(migrate_device_finalize); > > > > /** > > @@ -869,7 +890,8 @@ EXPORT_SYMBOL(migrate_device_finalize); > > */ > > void migrate_vma_finalize(struct migrate_vma *migrate) > > { > > - migrate_device_finalize(migrate->src, migrate->dst, migrate->npages); > > + __migrate_device_finalize(migrate->src, migrate->dst, migrate->npages, > > + migrate->fault_page); > > } > > EXPORT_SYMBOL(migrate_vma_finalize); > -- Simona Vetter Software Engineer, Intel Corporation http://blog.ffwll.ch