From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 77A13C27C4F for ; Thu, 13 Jun 2024 09:32:44 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 033BB6B00AF; Thu, 13 Jun 2024 05:32:44 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id F26856B00B0; Thu, 13 Jun 2024 05:32:43 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id DF7786B00B1; Thu, 13 Jun 2024 05:32:43 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0017.hostedemail.com [216.40.44.17]) by kanga.kvack.org (Postfix) with ESMTP id C1F326B00AF for ; Thu, 13 Jun 2024 05:32:43 -0400 (EDT) Received: from smtpin13.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id 7162DC0440 for ; Thu, 13 Jun 2024 09:32:43 +0000 (UTC) X-FDA: 82225350606.13.A23BE00 Received: from mail-pl1-f172.google.com (mail-pl1-f172.google.com [209.85.214.172]) by imf15.hostedemail.com (Postfix) with ESMTP id D562BA000F for ; Thu, 13 Jun 2024 09:32:37 +0000 (UTC) Authentication-Results: imf15.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=MgpNRZlO; spf=pass (imf15.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1718271161; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=/ST1cCivPhMFN53Rm65k70JPSTUTIoB61iuRO5jRUrI=; b=WXOdVg5/rIDaUBgd0C40OnaXtqE83SSehAq8oOt2bjluH0smQwH2ItgZLOAYJtwEVOSBrT qPYM51xskte1kxckzy09cTMBjkiXVtDxi+pMbbxk8iI9C9iPp1g/q5xwjA/VLWmY3ectAq 2wxLqKPNvAyf8u4jrQoXETMWHQyo48o= ARC-Authentication-Results: i=1; imf15.hostedemail.com; dkim=pass header.d=bytedance.com header.s=google header.b=MgpNRZlO; spf=pass (imf15.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.214.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=quarantine) header.from=bytedance.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1718271161; a=rsa-sha256; cv=none; b=RFLhNjZET2b+XQAo0vaUM5PvOBURV+20/j8NxPdzCNDHKvdVJCe3H5sqnxfvriZLu/E1+4 EAJrzSC+kuZFzcx/TmOsVCJYxPr4EfSJ934uwpuuu+Q16zEd6ZJi7bieEQLN3chveMucgO 1dkESucAivevkoE92suH4rDKsRB9fak= Received: by mail-pl1-f172.google.com with SMTP id d9443c01a7336-1f63a35bbb0so569285ad.0 for ; Thu, 13 Jun 2024 02:32:37 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance.com; s=google; t=1718271156; x=1718875956; darn=kvack.org; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :from:to:cc:subject:date:message-id:reply-to; bh=/ST1cCivPhMFN53Rm65k70JPSTUTIoB61iuRO5jRUrI=; b=MgpNRZlOiUGeiXaU8UZsZ7APznqt6idffGNId8VPbvT5Ex0yxlCzYX/P0b4NpkfnKe 1Zvak5/16lSgvj5G441DHWEFLa/ueD8w7uJ2MUpfNULX32UzZZVbphrp1jxEV2dMvg9I C977Ar7EkT7Q1WRDVfzFwLOgPOqWQoCVMKbupYsMKOfSJdusbprUiGhxuJ/9gaUEHer6 dxJhh/QuDDPkDbm4kyiqZy5QizWnVM4WorXD0bCheLldfxM2z2jFy9Z8G7WTJ7Z5dTt2 AQEQqOlOIpSDHZIz+w/D578JxYCBqays4nuagqyULAcntLy80rBzaDOuqX3DCnHjIk9H wf4Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20230601; t=1718271156; x=1718875956; h=content-transfer-encoding:in-reply-to:from:references:cc:to :content-language:subject:user-agent:mime-version:date:message-id :x-gm-message-state:from:to:cc:subject:date:message-id:reply-to; bh=/ST1cCivPhMFN53Rm65k70JPSTUTIoB61iuRO5jRUrI=; b=ERC5W76B/FLy72R1LnS8dKfTYga0ZZEhRAAvyDQoZCLlbOGlXRa/3UDXrJovO8V9eC MDvwROHwesrLyyf/BccbJs04dxIvrcTwDWYBhE1gJyeA71DNxQ5AbeAwabRtg1eFkFAS chGAZUxJWk5FWfA1JTBPkTh4ieedvLEPWuWV7w7mLXCNiGib0kBc0W91a+1un9fLqNFd FlbeiAiY1cc3LlrAYYPcScCuZo8tl4W+NqqsupwM76b5ME4rLwPXtXJEXHv7SsC53IOg aibK3ZCjwVkEEHXxNIZ3tR8azGchQXqGW0UXreCaxr2tyn7s2INrBXK+5WnmEPbkL+1n FwiQ== X-Forwarded-Encrypted: i=1; AJvYcCWGXT84K3EluooOzknfUVuLyDuv7xRjUvaiNP3l/xYPje653fefhXklXBDtQq8JEoWsVbxWwDWfRFJZUxRb7XsIQXM= X-Gm-Message-State: AOJu0Yww+XPwXgJ4yS86nECoZQVCayCl2g4w6cWz6jTfv+On7OU3Bssd UN4eLP1AS6YkLATAtbtnY+MOb40bos61i6F3lmRmdcXXYXfbvdEwerPl04IgRBo= X-Google-Smtp-Source: AGHT+IHNgM8CZ35kGrHdPnwg9elcRUwx1Uix+OiDO+zmRbp69PP6whHf/TrJP9NaKX2VA1DFmQYGyw== X-Received: by 2002:a05:6a20:dd9e:b0:1b5:ae2c:c730 with SMTP id adf61e73a8af0-1b8a9c5107cmr3926222637.3.1718271156437; Thu, 13 Jun 2024 02:32:36 -0700 (PDT) Received: from [10.84.144.49] ([203.208.167.150]) by smtp.gmail.com with ESMTPSA id 41be03b00d2f7-6fedcf36b93sm731870a12.18.2024.06.13.02.32.33 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 13 Jun 2024 02:32:36 -0700 (PDT) Message-ID: Date: Thu, 13 Jun 2024 17:32:30 +0800 MIME-Version: 1.0 User-Agent: Mozilla Thunderbird Subject: Re: [RFC PATCH 0/3] asynchronously scan and free empty user PTE pages Content-Language: en-US To: David Hildenbrand Cc: hughd@google.com, willy@infradead.org, mgorman@suse.de, muchun.song@linux.dev, akpm@linux-foundation.org, linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <02f8cbd0-8b2b-4c2d-ad96-f854d25bf3c2@redhat.com> From: Qi Zheng In-Reply-To: <02f8cbd0-8b2b-4c2d-ad96-f854d25bf3c2@redhat.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 8bit X-Rspamd-Server: rspam03 X-Rspam-User: X-Rspamd-Queue-Id: D562BA000F X-Stat-Signature: ars6kiu7b8ae7ac5ewyztprd6tz19jpi X-HE-Tag: 1718271157-899564 X-HE-Meta: U2FsdGVkX1/L34gdF2Y5twMOefAe59pSNMO6TPjqp+nRIWHsB15B1czzbbHOyOis+I9eyw2RDkVUs9OeI5YDQ4J/dats/ASXSMMv63iU+VwIi+LaFxp5zqueXlCNCt1vjz7gR/aNLwMqAd/uNeWAXBC+56QE42bk+elsgPwEQc8MJb7x6mbp/uMGgYXSDu/Yiw3Q9SecyDImS0B473CBfOu2XXcffRnQgtTSwB7AEsAFQhba218MCLOymJIGJ+bsIZdoY2QScr1/Dzi7izizOI3VPSZFFjzS3rCV7C+vHZEGvIvPOnTLqCwSH1JANRxGaPq2HTCWaHfxzYzUbMHNlYpbhXJ+ZyGPwbjbjqsFUOtYtH9IjxbOSmICu2b4JkM/PHnAW/5NU96CAk3tac86eds2oQqmEm7ehP6/WMizE7v1xj69umniQK2ajLYADwd8tM6Y+fwS4r4FvcYQceaXkC102Xea/F98zlgBlZUAqG7kguy4EXEnr52YynWBpe2GBpBTz6oXrALgaekhfvbzeC9OMUr3j8lIz7aKySHSKmhXl8p3Fc42ecx6dWr/WgmDtK/t5SHIuEy94vXM7mKNjq9g8Hh++tnPhJyN1Aa1myoIfjKJ2ILsfgCXDjl7jNxFBzbTLdz0/ddH0v4l/worIDB2TH8lGz1kspfPldQAOPv5Wz2Qu7iQDdj7rxPN3YVSUUxphZg7TJftOC4YF5RPkDMwyscozMCJduAquePHRlFaHXN8lsmwMMDCnEuZVXiK9X/KATIHoLRS273uY8zf7Q4YJ3kutTukTL8sYsOrBTxI18VRptRdnaqFUlCR9QvzoxN9uzvKnbywDDuKpsxXjQkvzceUXNrhlwaDv4MOz27MV6u9sGjzs2wKf5We8KA0Lzxalbz/tmO1y5yam/mwWelFOhLnfRcUyM+6W3BzkW2kwikFLnntnL3mYas0cG2+Yy/2nZZVhRLbny7HLTE eDtYpMLd cGBTyPuagmbAYXmWL1MJvp6ZAyHk6jpgJdShjhXu+PaA0n2RTGt88tsbm4hUkxQ072CWGw/k3FWhhryezI8CbQW6eAiVOpU7EMEmf3B7NPrgYqdPWvNw65CFhFwvWrrjBH0SVBM57CtqIp9p8dSckNdVN5V4oXuTw48woGv7+J4igsD0Asv8ukgez6X5RliPkCUrWF16G/PisRvw6BvXvVW1JeuVBGxnY5Qlhn9f31npPQk7K/KEhTRj5Cvo/x1ygT2NRX7smRwvKHy4NstP/aCVFw90Ch1kGYO5O6CsUQ769lOxvssihMLP9TnNCWMzf5phQDa6NtpPWYE92rBvaMWeo3Q== X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: List-Subscribe: List-Unsubscribe: Hi David, Thanks for such a quick reply! On 2024/6/13 17:04, David Hildenbrand wrote: > On 13.06.24 10:38, Qi Zheng wrote: >> Hi all, [...] > > >> 3. Implementation >> ================= >> >> For empty user PTE pages, we don't actually need to free it >> immediately, nor do >> we need to free all of it. >> >> Therefore, in this patchset, we register a task_work for the user >> tasks to >> asyncronously scan and free empty PTE pages when they return to user >> space. >> (The scanning time interval and address space size can be adjusted.) > > The question is, if we really have to scan asynchronously, or if would > be reasonable for most use cases to trigger a madvise(MADV_PT_RECLAIM) > every now and then. For virtio-mem, and likely most memory allocators, > that might be feasible, and valuable independent of system-wide > automatic scanning. Agree, I also think it is possible to add always && madvise modes simliar to THP. > >> >> When scanning, we can filter out some unsuitable vmas: >> >>      - VM_HUGETLB vma >>      - VM_UFFD_WP vma > > Why is UFFD_WP unsuitable? It should be suitable as long as you make > sure to really only remove page tables that are all pte_none(). Got it, I mistakenly thought pte_none() covered pte marker case until I saw pte_none_mostly(). > >>      - etc >> And for some PTE pages that spans multiple vmas, we can also skip. >> >> For locking: >> >>      - use the mmap read lock to traverse the vma tree and pgtable >>      - use pmd lock for clearing pmd entry >>      - use pte lock for checking empty PTE page, and release it after >> clearing >>        pmd entry, then we can capture the changed pmd in >> pte_offset_map_lock() >>        etc after holding this pte lock. Thanks to this, we don't need >> to hold the >>        rmap-related locks. >>      - users of pte_offset_map_lock() etc all expect the PTE page to >> be stable by >>        using rcu lock, so use pte_free_defer() to free PTE pages. > > I once had a protoype that would scan similar to GUP-fast, using the > mmap lock in read mode and disabling local IRQs and then walking the > page table locklessly (no PTLs). Only when identifying an empty page and > ripping out the page table, it would have to do more heavy locking (back > when we required the mmap lock in write mode and other things). Maybe mmap write lock is not necessary, we can protect it using pmd lock && pte lock as above. > > I can try digging up that patch if you're interested. Yes, that would be better, maybe it can provide more inspiration! > > We'll have to double check whether all anon memory cases can *properly* > handle pte_offset_map_lock() failing (not just handling it, but doing > the right thing; most of that anon-only code didn't ever run into that > issue so far, so these code paths were likely never triggered). Yeah, I'll keep checking this out too. > > >> For the path that will also free PTE pages in THP, we need to recheck >> whether the >> content of pmd entry is valid after holding pmd lock or pte lock. >> >> 4. TODO >> ======= >> >> Some applications may be concerned about the overhead of scanning and >> rebuilding >> page tables, so the following features are considered for >> implementation in the >> future: >> >>      - add per-process switch (via prctl) >>      - add a madvise option (like THP) >>      - add MM_PGTABLE_SCAN_DELAY/MM_PGTABLE_SCAN_SIZE control (via >> procfs file) >> Perhaps we can add the refcount to PTE pages in the future as well, >> which would >> help improve the scanning speed. > > I didn't like the added complexity last time, and the problem of > handling situations where we squeeze multiple page tables into a single > "struct page". OK, except for refcount, do you think the other three todos above are still worth doing? Thanks, Qi >