From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id C06B8C433F5 for ; Wed, 10 Nov 2021 13:54:33 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 4175861247 for ; Wed, 10 Nov 2021 13:54:33 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 4175861247 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=bytedance.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 99C356B006C; Wed, 10 Nov 2021 08:54:32 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 94A8E6B0071; Wed, 10 Nov 2021 08:54:32 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 8119D6B0072; Wed, 10 Nov 2021 08:54:32 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0040.hostedemail.com [216.40.44.40]) by kanga.kvack.org (Postfix) with ESMTP id 6CC5E6B006C for ; Wed, 10 Nov 2021 08:54:32 -0500 (EST) Received: from smtpin07.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 10B6C7E589 for ; Wed, 10 Nov 2021 13:54:32 +0000 (UTC) X-FDA: 78793165584.07.085C0D4 Received: from mail-pg1-f172.google.com (mail-pg1-f172.google.com [209.85.215.172]) by imf09.hostedemail.com (Postfix) with ESMTP id 01D193000111 for ; Wed, 10 Nov 2021 13:54:30 +0000 (UTC) Received: by mail-pg1-f172.google.com with SMTP id g184so2327700pgc.6 for ; Wed, 10 Nov 2021 05:54:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=bytedance-com.20210112.gappssmtp.com; s=20210112; h=message-id:date:mime-version:user-agent:subject:to:cc:references :from:in-reply-to:content-transfer-encoding; bh=csckiqmyP0dCF83S/nsqCaXOz0/oPqzPri47rPsjQiw=; b=uJAYmZAUASYBJLZgq//yft/HZ4r02Gr5daVYoLPw8TTSWbg5NUv9eutVT1fbjqoHOt x6+3fTzXEL0kjupOD/AHW3ALYxT/0uTfLXNQFFMxVFFA4lJrV7pytgjkVoNP5dBvv5Sz UGOKaHiuitKu/+YnBWo4D+sg9fziMZqm6x/Z5xaeZXRppmu770YVFJskK/hNyfg+GmBL NR3QLvxAvcZRB5hlfvbCQEPKwN6o/LIwJpuCsZMGIOsUfgzN6C4VcQPawEFl2vN3jpg7 GzqqLb79UmYjzrax9xOvzeXl6Q78aEl1ouSiwFPZ5wlY4DbiBlH+FTP377fs6XRUdoCS kfHA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent:subject :to:cc:references:from:in-reply-to:content-transfer-encoding; bh=csckiqmyP0dCF83S/nsqCaXOz0/oPqzPri47rPsjQiw=; b=kpFTfqGZuhhVHcYVrUR5mT/GlWbrCHeAFKY0iymmDDNxhBiV9U6E+OQqGXw0QQZU9F ayxNxW9EHaNazO4NoKQ18FBHynEC/tQ5Mnlyeh+IW1ovUSY7UZh8TAIHd2ZxsNwVzGCi fx4ftQPega5KUH49FYCMSnBJx9fkz4J0hwVghC3g6qm+TXYrVFgsSD29YbqJgsM3pqEf w3rXRsOJ4JOkfRJvl9EPeYTIT+imIOeWOTtLBC1KW4xvOwLbcpX4w9LAsDW6pvdZekjs TeIMeOPK9o53QVSu2Akiv8BzagzuPzInQGgTkLrXBf6i1BkaJO6PuSs8Qz3YSNhN1MxV CUqA== X-Gm-Message-State: AOAM5320SpdRVAHidXTZo3MC3d0hRrndwbx0qceKZfT2drlbc5hwvWR4 K5tPF8wNTSZAF6ZX0CwXn94POg== X-Google-Smtp-Source: ABdhPJwqPZpOvU5AnVskn/IE9qCZu6lQuMcHlHjxOfCktGz2gqvzlzh+5aPrY0BgRFTFnqYnzLvayg== X-Received: by 2002:a65:670d:: with SMTP id u13mr11993060pgf.455.1636552469727; Wed, 10 Nov 2021 05:54:29 -0800 (PST) Received: from [10.254.189.129] ([139.177.225.251]) by smtp.gmail.com with ESMTPSA id f5sm5917256pju.15.2021.11.10.05.54.25 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Wed, 10 Nov 2021 05:54:29 -0800 (PST) Message-ID: <32a2432f-7f18-db5e-87a7-d8ba7c543076@bytedance.com> Date: Wed, 10 Nov 2021 21:54:22 +0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.15; rv:91.0) Gecko/20100101 Thunderbird/91.3.0 Subject: Re: [PATCH v3 00/15] Free user PTE page table pages To: Jason Gunthorpe Cc: akpm@linux-foundation.org, tglx@linutronix.de, kirill.shutemov@linux.intel.com, mika.penttila@nextfour.com, david@redhat.com, linux-doc@vger.kernel.org, linux-kernel@vger.kernel.org, linux-mm@kvack.org, songmuchun@bytedance.com, zhouchengming@bytedance.com References: <20211110105428.32458-1-zhengqi.arch@bytedance.com> <20211110125601.GQ1740502@nvidia.com> From: Qi Zheng In-Reply-To: <20211110125601.GQ1740502@nvidia.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit Authentication-Results: imf09.hostedemail.com; dkim=pass header.d=bytedance-com.20210112.gappssmtp.com header.s=20210112 header.b=uJAYmZAU; spf=pass (imf09.hostedemail.com: domain of zhengqi.arch@bytedance.com designates 209.85.215.172 as permitted sender) smtp.mailfrom=zhengqi.arch@bytedance.com; dmarc=pass (policy=none) header.from=bytedance.com X-Rspamd-Server: rspam04 X-Rspamd-Queue-Id: 01D193000111 X-Stat-Signature: pw6zj9atw6o7k5macp9rhofqyx7azkeq X-HE-Tag: 1636552470-229965 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 11/10/21 8:56 PM, Jason Gunthorpe wrote: > On Wed, Nov 10, 2021 at 06:54:13PM +0800, Qi Zheng wrote: > >> In this patch series, we add a pte_refcount field to the struct page of page >> table to track how many users of PTE page table. Similar to the mechanism of >> page refcount, the user of PTE page table should hold a refcount to it before >> accessing. The PTE page table page will be freed when the last refcount is >> dropped. > > So, this approach basically adds two atomics on every PTE map > > If I have it right the reason that zap cannot clean the PTEs today is > because zap cannot obtain the mmap lock due to a lock ordering issue > with the inode lock vs mmap lock. Currently, both MADV_DONTNEED and MADV_FREE obtain the read side of mmap_lock instead of write side, which is the reason that jemalloc/tcmalloc prefer to use madvise() to release physical memory. > > If it could obtain the mmap lock then it could do the zap using the > write side as unmapping a vma does. Even if it obtains the write side of mmap_lock, how to make sure that all the page table entries are empty? Traverse 512 entries every time? > > Rather than adding a new "lock" to ever PTE I wonder if it would be > more efficient to break up the mmap lock and introduce a specific > rwsem for the page table itself, in addition to the PTL. Currently the > mmap lock is protecting both the vma list and the page table. Now each level of page table has its own spin lock. Can you explain the working mechanism of this special rwsem more clearly? If we can reduce the protection range of mmap_lock, it is indeed a great thing, but I think it is very difficult, and it will not solve the problem of how to check that all entries in the page table page are empty. > > I think that would allow the lock ordering issue to be resolved and > zap could obtain a page table rwsem. > > Compared to two atomics per PTE this would just be two atomic per > page table walk operation, it is conceptually a lot simpler, and would > allow freeing all the page table levels, not just PTEs. The reason why only the PTE page is released now is that it is the largest. This reference count can actually be used for other levels of page tables. > > ? > > Jason >