From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 2A21EC87FCB for ; Fri, 1 Aug 2025 12:20:36 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 9244B6B007B; Fri, 1 Aug 2025 08:20:36 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 8CE586B0088; Fri, 1 Aug 2025 08:20:36 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 795AD6B008A; Fri, 1 Aug 2025 08:20:36 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 6419F6B007B for ; Fri, 1 Aug 2025 08:20:36 -0400 (EDT) Received: from smtpin19.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay07.hostedemail.com (Postfix) with ESMTP id DF806160809 for ; Fri, 1 Aug 2025 12:20:35 +0000 (UTC) X-FDA: 83728096830.19.B8871E5 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 80BC9C0005 for ; Fri, 1 Aug 2025 12:20:33 +0000 (UTC) Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=OClBj2w2; spf=pass (imf22.hostedemail.com: domain of mpenttil@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=mpenttil@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1754050833; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=UrJ7VtJUo8Wxruc1oHhvPS2vp0zNWm/UfFn9rqfUdiA=; b=eeVe2n1lp2im727SK64KPToUw4qyk6riHjIZJHkKrl1y8ojb+Us//zUjOMHgrKmgT72CIh CKcq2qb4Vu1RxVAHlyjrisTUw4esKj1H3VBN57Y4eMv5xCMb21phmX93iSPmsNGw+2PVUJ MOQi0U+D9qVDhDsCp0e2JabNd/Jd/84= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=OClBj2w2; spf=pass (imf22.hostedemail.com: domain of mpenttil@redhat.com designates 170.10.133.124 as permitted sender) smtp.mailfrom=mpenttil@redhat.com; dmarc=pass (policy=quarantine) header.from=redhat.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1754050833; a=rsa-sha256; cv=none; b=23DSE6Y9vsvM1mDxpioQgX27yXQBiyc3b5JvMMd3TrnLmp2d5JFGF+Rcm7eua/OQ0pYsZI 7ZSZ6dU+sZmLDW60uNAfgVDHwemyBqcm4TWybgpv0CBPEY/5N66cobrxaoaKzyg9Zmmixd gJcXoxA7WMvv1VpUprMDue7u+y0CzbM= DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1754050832; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=UrJ7VtJUo8Wxruc1oHhvPS2vp0zNWm/UfFn9rqfUdiA=; b=OClBj2w2diuYxaTHHhzlngUnF/CBbtCtjP1R4+Ke1QArLPgvOd/86HLeuUeSa6gyWZDo4m IBXJmFIGpKfXEodsNwuR1NQc5eV6+h7Oebkvaom3YtZC171HjX6m5I4SDr1TegegEoBP/V O+g9eJnbct7Apym/K+xfXkm/qCKmAi0= Received: from mail-lj1-f200.google.com (mail-lj1-f200.google.com [209.85.208.200]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.3, cipher=TLS_AES_256_GCM_SHA384) id us-mta-70-TqN22uOsMFG8CNTl3r_Vbg-1; Fri, 01 Aug 2025 08:20:29 -0400 X-MC-Unique: TqN22uOsMFG8CNTl3r_Vbg-1 X-Mimecast-MFC-AGG-ID: TqN22uOsMFG8CNTl3r_Vbg_1754050828 Received: by mail-lj1-f200.google.com with SMTP id 38308e7fff4ca-3321638a590so7613841fa.2 for ; Fri, 01 Aug 2025 05:20:29 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1754050828; x=1754655628; h=content-transfer-encoding:in-reply-to:from:content-language :references:cc:to:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=UrJ7VtJUo8Wxruc1oHhvPS2vp0zNWm/UfFn9rqfUdiA=; b=UEEEdx5TP2w+5LhyXGvZA2aJkbASnmMT315zEynn746GczERuQ4NxMQN2EnKidW2TF egqEbs1Zo4D2nqE7brI2vWMwBQ6tFE977xVY2T0Ue24Ptiin17J8todenH5+fxmK6mTF cVNyBl5fDfUnELAKN2qtnWDgI7uFNk3kO+ien5OrpXBIRnwy3zyf50AdGeizPeKa6Ig1 gsFepNyPZXKNp3WmGftRuTaSM2PlYVvocSsUfY0x3kw92IK0ZiZ+3WBYtkSIzdgXSz0z C3KwyylICUid4XlXyWj3ibKQ+Gab8Uz1Mk1Z1+i9k/R/JuCj0Wnkz4ISaWP81uqvYSq5 ZEcQ== X-Forwarded-Encrypted: i=1; AJvYcCUshYmXRn/2IgDDq2UyCh0rYT9xBmspguXcFqjnkq1pYGlctW4fRt8ofb3KBi5CrP+o5ja5tR/gaQ==@kvack.org X-Gm-Message-State: AOJu0Yz4u3kN/Qwm0JuE1ufqyfB5LGUyWhyGejbnO577+9F9e5cZLaFC VRCvTcpj+bVtNfiekq8aGpa6v4+ptCDYVH8IY6fJszgPvJvOl2JAiMYLJNFuiz+2AdCZ1ROsADq W5nSobFcLSINTGhswFD4UUPN1gvxzDUpl2ygyf3o9Hls9E9Bo83U= X-Gm-Gg: ASbGncusWSpRLz7H/bYtjhemO98MUugSwpW8xweEPTwDpO8rFYtQhfI1783hNXRyj4j adSN/4iDXKiBymLT23OLgvfLyMa8CfHRiwWKIjK2/PW1ZUuPXiIBm0j14lgnpspKAiBWyYzoRtj VvtYztE9HezN26kFjSOHG925a89vxwVPgyJ4LL++uOkHihQjhQ92eZtKAfCBzjBQoqSvle+C3WQ KMBJ60PxNLQBtLJNCkul1yrzu9GNzt319AiCedJk/9yMmLVY5EB3ChJjjHPcHSPs7CU1Ujnvhlp xDoTj+nObnib7eikDTpCbNqv/YklEdTlC3GgoN0Bw9KapBMJqcpPUtxErF0dPZxsvg== X-Received: by 2002:a05:6512:3042:b0:549:8b24:989d with SMTP id 2adb3069b0e04-55b7bf7cf45mr3966812e87.0.1754050827924; Fri, 01 Aug 2025 05:20:27 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGzpZyWEVvzoNFI4AdCocDOy7KyMJrn7NHNUKC5Ufu/fq052bkQFqEMmvcKlUZhJjFpW+Y8pg== X-Received: by 2002:a05:6512:3042:b0:549:8b24:989d with SMTP id 2adb3069b0e04-55b7bf7cf45mr3966801e87.0.1754050827405; Fri, 01 Aug 2025 05:20:27 -0700 (PDT) Received: from [192.168.1.86] (85-23-48-6.bb.dnainternet.fi. [85.23.48.6]) by smtp.gmail.com with ESMTPSA id 2adb3069b0e04-55b88cabb31sm569671e87.142.2025.08.01.05.20.26 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Fri, 01 Aug 2025 05:20:26 -0700 (PDT) Message-ID: <2406521e-f5be-474e-b653-e5ad38a1d7de@redhat.com> Date: Fri, 1 Aug 2025 15:20:25 +0300 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [v2 02/11] mm/thp: zone_device awareness in THP handling code To: Zi Yan , David Hildenbrand Cc: Balbir Singh , linux-mm@kvack.org, linux-kernel@vger.kernel.org, Karol Herbst , Lyude Paul , Danilo Krummrich , David Airlie , Simona Vetter , =?UTF-8?B?SsOpcsO0bWUgR2xpc3Nl?= , Shuah Khan , Barry Song , Baolin Wang , Ryan Roberts , Matthew Wilcox , Peter Xu , Kefeng Wang , Jane Chu , Alistair Popple , Donet Tom , Matthew Brost , Francois Dugast , Ralph Campbell References: <20250730092139.3890844-1-balbirs@nvidia.com> <9f836828-4f53-41a0-b5f7-bbcd2084086e@redhat.com> <884b9246-de7c-4536-821f-1bf35efe31c8@redhat.com> <6291D401-1A45-4203-B552-79FE26E151E4@nvidia.com> <8E2CE1DF-4C37-4690-B968-AEA180FF44A1@nvidia.com> <2308291f-3afc-44b4-bfc9-c6cf0cdd6295@redhat.com> <9FBDBFB9-8B27-459C-8047-055F90607D60@nvidia.com> <11ee9c5e-3e74-4858-bf8d-94daf1530314@redhat.com> <14aeaecc-c394-41bf-ae30-24537eb299d9@nvidia.com> <71c736e9-eb77-4e8e-bd6a-965a1bbcbaa8@nvidia.com> <47BC6D8B-7A78-4F2F-9D16-07D6C88C3661@nvidia.com> From: =?UTF-8?Q?Mika_Penttil=C3=A4?= In-Reply-To: <47BC6D8B-7A78-4F2F-9D16-07D6C88C3661@nvidia.com> X-Mimecast-Spam-Score: 0 X-Mimecast-MFC-PROC-ID: gOweLctMs-AAgLLCXLIMKH9M8ToPDFZ_9iYf3l30qKs_1754050828 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit X-Stat-Signature: fdgd31wd75hmhybarsq33jynnr917td8 X-Rspamd-Server: rspam01 X-Rspamd-Queue-Id: 80BC9C0005 X-Rspam-User: X-HE-Tag: 1754050833-971929 X-HE-Meta: U2FsdGVkX1/iAxnJtt0Wyl0WX3MsRvy0LSHyUb5ojlwWKne08dW2E9z+Qs2p42XbpmVLOSCv8abbnExPqZkaQJb5WI9TB60/6d5k8bQqiCzK2Jjh13ZuaFcUov/O3erc+UKpzzi5x2zkWEA35/0dmBPadbUv74x94Bfkr1BC36O64m49C2t/GVTY0unx6slu7cDeyDEYgq6fMv3OYzY9d9hDlPwEaUlsr8n215P4nISFHhMLLLEdVZsCaOTS3nbT+r1foAn2nvXMfeL99mgsxpdO4iZi2jI8JhFxdYWW7YNtQaY1pUpXZmwFuC+wujQzuI1q/E4J9aeOuU2Aku+nY1nPDvemkLQiZTlB6Vu96NF0S3hW5THl5Hypun+7Hufc3jmtp0MVCTN6EFMFlJY/BqvJky6RUsmkdzac5P/vDGGMQ5KHQlMuSn2ykYvhcxf9v1f4hR93/irTog9Cf+eo69MQOlybz440rWK5FWtVnC+I29adiu94SkHHNY39nTeYCyC8wN5ut7EW62XUbuIk4R1SpIzwPC6CgQ3vCQqaDn9N7L4c4k6wjEi9TqYtHasqd2O/gjiOcyeXjcr7d1ocigdWNgi11CTqtED1kkhSBx1W4fjdKips05jQ6i4WrT8LFIDqzJNxo7lE1s3GMZu9fqx6uMadvdFaKXTmRudLDX3gdka4Jnkk9LQga78klNvMK+vlQzwphrguuNEgVQAFOQkY4ZWDz+qyrwzVs/pSi+EoEOm8b0DZwhQT1GTxrBNyPy+IzBjvE1fIapvEdZltQWksm4oGyScbRPynjIDi9FOl8bKgd9Q93+bp80fDBiLiX6IiI6CBp38cC4dwxF0bQtWkUbLIv7/sO5Rt1YEdk0+SYgXNn/RX8aj70inFlSut6yv7VC4sEYiJrSeeAAaC095pH8g6TYWBOKc37OQfiNhRYbboQe95NO9+mSYzAelsItH1vuquG16kD//09Cv gSfTeqDl Lu4Tt2g8HF5xzLKIv2cBrnFU9IPZWM2LJW12jN4yrHuXvTGrcBWvyDjKhBYls3yApsR8M75u161V093d1TZtXIpWJuq+fORP3g3/9BI0G6p6diKefar7UnxYUESeuHoRkeIeQ/NM7qfqXf4vJLT0fhOoXn+hb51f/JJGWNJKlnIze50bo9ynJTzNSR8xJlWVXQkGILFqVjItKtQbMFjw4qVnJzV0BgqgDSjl7XwRHUCw+nN1wtp5TkOngN1ZHchboLMzXB4nUjjRD5TvXv8aKXX1Qun7kDqDcMznDK7QG27piwxdzS+huEBdE1yFEV0hDLUyoEMBndbeT9eDR5dg8EZtKPTB+7P1O4W3Wwl+HOIho0f7LpEzjkWrU/Dd8AJKOsrq4hpgyklC600FAVtirqNMM0YG4LWydNSz98ahkN+KVjB0i5RgwBlzk9gSRwvi067gdJxl9D8mGh0vOZveI0KYbuua5sAmaSxNl/DRAX238kKfXlwTzYGUkKetlDHROYGs7WadrrblXicWHl1dT95/wbtVOc69ewzFqHKN7pBH4Fvh38bWcH4hJgqIOm2bnBt04k4K8s3qCzEchtPRPgycIYGjjJnqUxfFXAhQoDr+6quuL1Lq2ct6ZKNyN17zTLc1Z3lFONtWv+leIHS3sfj5f7nWQkeG1evIRsjFcWYccrDqZqw/BR36OOJtSMN4HIyO7Sjb4yBvqsMB1AO59YO+Vag4fNddAW4rWOUWX4uq4jPQnwMva3M3rYvCaEfbESX7ZAOPUJvx4qv0D0SUxAo9kGe70adklLm/uEpH5YgWkZMkRHcEKm52FhclfZHLJW4nckx9UWwb3YV8FugmbmrEGGAVzrZQE8y0Rpc2IJtthlAVhHw2FtnjwZSogo1KkXlr0V1tITHXMkTjtUNB/t7pjspGxyH9/MudShgY0CNSz0Ok= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: On 8/1/25 14:10, Zi Yan wrote: > On 1 Aug 2025, at 4:46, David Hildenbrand wrote: > >> On 01.08.25 10:01, Balbir Singh wrote: >>> On 8/1/25 17:04, David Hildenbrand wrote: >>>> On 01.08.25 06:44, Balbir Singh wrote: >>>>> On 8/1/25 11:16, Mika Penttilä wrote: >>>>>> Hi, >>>>>> >>>>>> On 8/1/25 03:49, Balbir Singh wrote: >>>>>> >>>>>>> On 7/31/25 21:26, Zi Yan wrote: >>>>>>>> On 31 Jul 2025, at 3:15, David Hildenbrand wrote: >>>>>>>> >>>>>>>>> On 30.07.25 18:29, Mika Penttilä wrote: >>>>>>>>>> On 7/30/25 18:58, Zi Yan wrote: >>>>>>>>>>> On 30 Jul 2025, at 11:40, Mika Penttilä wrote: >>>>>>>>>>> >>>>>>>>>>>> On 7/30/25 18:10, Zi Yan wrote: >>>>>>>>>>>>> On 30 Jul 2025, at 8:49, Mika Penttilä wrote: >>>>>>>>>>>>> >>>>>>>>>>>>>> On 7/30/25 15:25, Zi Yan wrote: >>>>>>>>>>>>>>> On 30 Jul 2025, at 8:08, Mika Penttilä wrote: >>>>>>>>>>>>>>> >>>>>>>>>>>>>>>> On 7/30/25 14:42, Mika Penttilä wrote: >>>>>>>>>>>>>>>>> On 7/30/25 14:30, Zi Yan wrote: >>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:27, Zi Yan wrote: >>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> On 30 Jul 2025, at 7:16, Mika Penttilä wrote: >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> Hi, >>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>> On 7/30/25 12:21, Balbir Singh wrote: >>>>>>>>>>>>>>>>>>>>> Make THP handling code in the mm subsystem for THP pages aware of zone >>>>>>>>>>>>>>>>>>>>> device pages. Although the code is designed to be generic when it comes >>>>>>>>>>>>>>>>>>>>> to handling splitting of pages, the code is designed to work for THP >>>>>>>>>>>>>>>>>>>>> page sizes corresponding to HPAGE_PMD_NR. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Modify page_vma_mapped_walk() to return true when a zone device huge >>>>>>>>>>>>>>>>>>>>> entry is present, enabling try_to_migrate() and other code migration >>>>>>>>>>>>>>>>>>>>> paths to appropriately process the entry. page_vma_mapped_walk() will >>>>>>>>>>>>>>>>>>>>> return true for zone device private large folios only when >>>>>>>>>>>>>>>>>>>>> PVMW_THP_DEVICE_PRIVATE is passed. This is to prevent locations that are >>>>>>>>>>>>>>>>>>>>> not zone device private pages from having to add awareness. The key >>>>>>>>>>>>>>>>>>>>> callback that needs this flag is try_to_migrate_one(). The other >>>>>>>>>>>>>>>>>>>>> callbacks page idle, damon use it for setting young/dirty bits, which is >>>>>>>>>>>>>>>>>>>>> not significant when it comes to pmd level bit harvesting. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> pmd_pfn() does not work well with zone device entries, use >>>>>>>>>>>>>>>>>>>>> pfn_pmd_entry_to_swap() for checking and comparison as for zone device >>>>>>>>>>>>>>>>>>>>> entries. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Zone device private entries when split via munmap go through pmd split, >>>>>>>>>>>>>>>>>>>>> but need to go through a folio split, deferred split does not work if a >>>>>>>>>>>>>>>>>>>>> fault is encountered because fault handling involves migration entries >>>>>>>>>>>>>>>>>>>>> (via folio_migrate_mapping) and the folio sizes are expected to be the >>>>>>>>>>>>>>>>>>>>> same there. This introduces the need to split the folio while handling >>>>>>>>>>>>>>>>>>>>> the pmd split. Because the folio is still mapped, but calling >>>>>>>>>>>>>>>>>>>>> folio_split() will cause lock recursion, the __split_unmapped_folio() >>>>>>>>>>>>>>>>>>>>> code is used with a new helper to wrap the code >>>>>>>>>>>>>>>>>>>>> split_device_private_folio(), which skips the checks around >>>>>>>>>>>>>>>>>>>>> folio->mapping, swapcache and the need to go through unmap and remap >>>>>>>>>>>>>>>>>>>>> folio. >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Cc: Karol Herbst >>>>>>>>>>>>>>>>>>>>> Cc: Lyude Paul >>>>>>>>>>>>>>>>>>>>> Cc: Danilo Krummrich >>>>>>>>>>>>>>>>>>>>> Cc: David Airlie >>>>>>>>>>>>>>>>>>>>> Cc: Simona Vetter >>>>>>>>>>>>>>>>>>>>> Cc: "Jérôme Glisse" >>>>>>>>>>>>>>>>>>>>> Cc: Shuah Khan >>>>>>>>>>>>>>>>>>>>> Cc: David Hildenbrand >>>>>>>>>>>>>>>>>>>>> Cc: Barry Song >>>>>>>>>>>>>>>>>>>>> Cc: Baolin Wang >>>>>>>>>>>>>>>>>>>>> Cc: Ryan Roberts >>>>>>>>>>>>>>>>>>>>> Cc: Matthew Wilcox >>>>>>>>>>>>>>>>>>>>> Cc: Peter Xu >>>>>>>>>>>>>>>>>>>>> Cc: Zi Yan >>>>>>>>>>>>>>>>>>>>> Cc: Kefeng Wang >>>>>>>>>>>>>>>>>>>>> Cc: Jane Chu >>>>>>>>>>>>>>>>>>>>> Cc: Alistair Popple >>>>>>>>>>>>>>>>>>>>> Cc: Donet Tom >>>>>>>>>>>>>>>>>>>>> Cc: Mika Penttilä >>>>>>>>>>>>>>>>>>>>> Cc: Matthew Brost >>>>>>>>>>>>>>>>>>>>> Cc: Francois Dugast >>>>>>>>>>>>>>>>>>>>> Cc: Ralph Campbell >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> Signed-off-by: Matthew Brost >>>>>>>>>>>>>>>>>>>>> Signed-off-by: Balbir Singh >>>>>>>>>>>>>>>>>>>>> --- >>>>>>>>>>>>>>>>>>>>>    include/linux/huge_mm.h |   1 + >>>>>>>>>>>>>>>>>>>>>    include/linux/rmap.h    |   2 + >>>>>>>>>>>>>>>>>>>>>    include/linux/swapops.h |  17 +++ >>>>>>>>>>>>>>>>>>>>>    mm/huge_memory.c        | 268 +++++++++++++++++++++++++++++++++------- >>>>>>>>>>>>>>>>>>>>>    mm/page_vma_mapped.c    |  13 +- >>>>>>>>>>>>>>>>>>>>>    mm/pgtable-generic.c    |   6 + >>>>>>>>>>>>>>>>>>>>>    mm/rmap.c               |  22 +++- >>>>>>>>>>>>>>>>>>>>>    7 files changed, 278 insertions(+), 51 deletions(-) >>>>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>> >>>>>>>>>>>>>>>>>>>>> +/** >>>>>>>>>>>>>>>>>>>>> + * split_huge_device_private_folio - split a huge device private folio into >>>>>>>>>>>>>>>>>>>>> + * smaller pages (of order 0), currently used by migrate_device logic to >>>>>>>>>>>>>>>>>>>>> + * split folios for pages that are partially mapped >>>>>>>>>>>>>>>>>>>>> + * >>>>>>>>>>>>>>>>>>>>> + * @folio: the folio to split >>>>>>>>>>>>>>>>>>>>> + * >>>>>>>>>>>>>>>>>>>>> + * The caller has to hold the folio_lock and a reference via folio_get >>>>>>>>>>>>>>>>>>>>> + */ >>>>>>>>>>>>>>>>>>>>> +int split_device_private_folio(struct folio *folio) >>>>>>>>>>>>>>>>>>>>> +{ >>>>>>>>>>>>>>>>>>>>> +    struct folio *end_folio = folio_next(folio); >>>>>>>>>>>>>>>>>>>>> +    struct folio *new_folio; >>>>>>>>>>>>>>>>>>>>> +    int ret = 0; >>>>>>>>>>>>>>>>>>>>> + >>>>>>>>>>>>>>>>>>>>> +    /* >>>>>>>>>>>>>>>>>>>>> +     * Split the folio now. In the case of device >>>>>>>>>>>>>>>>>>>>> +     * private pages, this path is executed when >>>>>>>>>>>>>>>>>>>>> +     * the pmd is split and since freeze is not true >>>>>>>>>>>>>>>>>>>>> +     * it is likely the folio will be deferred_split. >>>>>>>>>>>>>>>>>>>>> +     * >>>>>>>>>>>>>>>>>>>>> +     * With device private pages, deferred splits of >>>>>>>>>>>>>>>>>>>>> +     * folios should be handled here to prevent partial >>>>>>>>>>>>>>>>>>>>> +     * unmaps from causing issues later on in migration >>>>>>>>>>>>>>>>>>>>> +     * and fault handling flows. >>>>>>>>>>>>>>>>>>>>> +     */ >>>>>>>>>>>>>>>>>>>>> +    folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio)); >>>>>>>>>>>>>>>>>>>> Why can't this freeze fail? The folio is still mapped afaics, why can't there be other references in addition to the caller? >>>>>>>>>>>>>>>>>>> Based on my off-list conversation with Balbir, the folio is unmapped in >>>>>>>>>>>>>>>>>>> CPU side but mapped in the device. folio_ref_freeeze() is not aware of >>>>>>>>>>>>>>>>>>> device side mapping. >>>>>>>>>>>>>>>>>> Maybe we should make it aware of device private mapping? So that the >>>>>>>>>>>>>>>>>> process mirrors CPU side folio split: 1) unmap device private mapping, >>>>>>>>>>>>>>>>>> 2) freeze device private folio, 3) split unmapped folio, 4) unfreeze, >>>>>>>>>>>>>>>>>> 5) remap device private mapping. >>>>>>>>>>>>>>>>> Ah ok this was about device private page obviously here, nevermind.. >>>>>>>>>>>>>>>> Still, isn't this reachable from split_huge_pmd() paths and folio is mapped to CPU page tables as a huge device page by one or more task? >>>>>>>>>>>>>>> The folio only has migration entries pointing to it. From CPU perspective, >>>>>>>>>>>>>>> it is not mapped. The unmap_folio() used by __folio_split() unmaps a to-be-split >>>>>>>>>>>>>>> folio by replacing existing page table entries with migration entries >>>>>>>>>>>>>>> and after that the folio is regarded as “unmapped”. >>>>>>>>>>>>>>> >>>>>>>>>>>>>>> The migration entry is an invalid CPU page table entry, so it is not a CPU >>>>>>>>>>>>>> split_device_private_folio() is called for device private entry, not migrate entry afaics. >>>>>>>>>>>>> Yes, but from CPU perspective, both device private entry and migration entry >>>>>>>>>>>>> are invalid CPU page table entries, so the device private folio is “unmapped” >>>>>>>>>>>>> at CPU side. >>>>>>>>>>>> Yes both are "swap entries" but there's difference, the device private ones contribute to mapcount and refcount. >>>>>>>>>>> Right. That confused me when I was talking to Balbir and looking at v1. >>>>>>>>>>> When a device private folio is processed in __folio_split(), Balbir needed to >>>>>>>>>>> add code to skip CPU mapping handling code. Basically device private folios are >>>>>>>>>>> CPU unmapped and device mapped. >>>>>>>>>>> >>>>>>>>>>> Here are my questions on device private folios: >>>>>>>>>>> 1. How is mapcount used for device private folios? Why is it needed from CPU >>>>>>>>>>>      perspective? Can it be stored in a device private specific data structure? >>>>>>>>>> Mostly like for normal folios, for instance rmap when doing migrate. I think it would make >>>>>>>>>> common code more messy if not done that way but sure possible. >>>>>>>>>> And not consuming pfns (address space) at all would have benefits. >>>>>>>>>> >>>>>>>>>>> 2. When a device private folio is mapped on device, can someone other than >>>>>>>>>>>      the device driver manipulate it assuming core-mm just skips device private >>>>>>>>>>>      folios (barring the CPU access fault handling)? >>>>>>>>>>> >>>>>>>>>>> Where I am going is that can device private folios be treated as unmapped folios >>>>>>>>>>> by CPU and only device driver manipulates their mappings? >>>>>>>>>>> >>>>>>>>>> Yes not present by CPU but mm has bookkeeping on them. The private page has no content >>>>>>>>>> someone could change while in device, it's just pfn. >>>>>>>>> Just to clarify: a device-private entry, like a device-exclusive entry, is a *page table mapping* tracked through the rmap -- even though they are not present page table entries. >>>>>>>>> >>>>>>>>> It would be better if they would be present page table entries that are PROT_NONE, but it's tricky to mark them as being "special" device-private, device-exclusive etc. Maybe there are ways to do that in the future. >>>>>>>>> >>>>>>>>> Maybe device-private could just be PROT_NONE, because we can identify the entry type based on the folio. device-exclusive is harder ... >>>>>>>>> >>>>>>>>> >>>>>>>>> So consider device-private entries just like PROT_NONE present page table entries. Refcount and mapcount is adjusted accordingly by rmap functions. >>>>>>>> Thanks for the clarification. >>>>>>>> >>>>>>>> So folio_mapcount() for device private folios should be treated the same >>>>>>>> as normal folios, even if the corresponding PTEs are not accessible from CPUs. >>>>>>>> Then I wonder if the device private large folio split should go through >>>>>>>> __folio_split(), the same as normal folios: unmap, freeze, split, unfreeze, >>>>>>>> remap. Otherwise, how can we prevent rmap changes during the split? >>>>>>>> >>>>>>> That is true in general, the special cases I mentioned are: >>>>>>> >>>>>>> 1. split during migration (where we the sizes on source/destination do not >>>>>>>     match) and so we need to split in the middle of migration. The entries >>>>>>>     there are already unmapped and hence the special handling >>>>>>> 2. Partial unmap case, where we need to split in the context of the unmap >>>>>>>     due to the isses mentioned in the patch. I expanded the folio split code >>>>>>>     for device private can be expanded into its own helper, which does not >>>>>>>     need to do the xas/mapped/lru folio handling. During partial unmap the >>>>>>>     original folio does get replaced by new anon rmap ptes (split_huge_pmd_locked) >>>>>>> >>>>>>> For (2), I spent some time examining the implications of not unmapping the >>>>>>> folios prior to split and in the partial unmap path, once we split the PMD >>>>>>> the folios diverge. I did not run into any particular race either with the >>>>>>> tests. >>>>>> 1) is totally fine. This was in v1 and lead to Zi's split_unmapped_folio() >>>>>> >>>>>> 2) is a problem because folio is mapped. split_huge_pmd() can be reached also from other than unmap path. >>>>>> It is vulnerable to races by rmap. And for instance this does not look right without checking: >>>>>> >>>>>>     folio_ref_freeze(folio, 1 + folio_expected_ref_count(folio)); >>>>>> >>>>> I can add checks to make sure that the call does succeed. >>>>> >>>>>> You mention 2) is needed because of some later problems in fault path after pmd split. Would it be >>>>>> possible to split the folio at fault time then? >>>>> So after the partial unmap, the folio ends up in a little strange situation, the folio is large, >>>>> but not mapped (since large_mapcount can be 0, after all the folio_rmap_remove_ptes). Calling folio_split() >>>>> on partially unmapped fails because folio_get_anon_vma() fails due to the folio_mapped() failures >>>>> related to folio_large_mapcount. There is also additional complexity with ref counts and mapping. >>>> I think you mean "Calling folio_split() on a *fully* unmapped folio fails ..." >>>> >>>> A partially mapped folio still has folio_mapcount() > 0 -> folio_mapped() == true. >>>> >>> Looking into this again at my end >>> >>>>> >>>>>> Also, didn't quite follow what kind of lock recursion did you encounter doing proper split_folio() >>>>>> instead? >>>>>> >>>>>> >>>>> Splitting during partial unmap causes recursive locking issues with anon_vma when invoked from >>>>> split_huge_pmd_locked() path. >>>> Yes, that's very complicated. >>>> >>> Yes and I want to avoid going down that path. >>> >>>>> Deferred splits do not work for device private pages, due to the >>>>> migration requirements for fault handling. >>>> Can you elaborate on that? >>>> >>> If a folio is under deferred_split() and is still pending a split. When a fault is handled on a partially >>> mapped folio, the expectation is that as a part of fault handling during migration, the code in migrate_folio_mapping() >>> assumes that the folio sizes are the same (via check for reference and mapcount) >> If you hit a partially-mapped folio, instead of migrating, you would actually want to split and then migrate I assume. > Yes, that is exactly what migrate_pages() does. And if split fails, the migration > fails too. Device private folio probably should do the same thing, assuming > splitting device private folio would always succeed. hmm afaics the normal folio_split wants to use RMP_USE_SHARED_ZEROPAGE when splitting and remapping device private pages, that can't work.. > > Best Regards, > Yan, Zi > --Mika