From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id D365DC433F5 for ; Tue, 22 Feb 2022 08:56:25 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 5E0068D0002; Tue, 22 Feb 2022 03:56:25 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id 58FAE8D0001; Tue, 22 Feb 2022 03:56:25 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 431788D0002; Tue, 22 Feb 2022 03:56:25 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (relay.hostedemail.com [64.99.140.26]) by kanga.kvack.org (Postfix) with ESMTP id 3124A8D0001 for ; Tue, 22 Feb 2022 03:56:25 -0500 (EST) Received: from smtpin08.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay10.hostedemail.com (Postfix) with ESMTP id F3261EDB for ; Tue, 22 Feb 2022 08:56:24 +0000 (UTC) X-FDA: 79169809488.08.9BFD039 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.129.124]) by imf19.hostedemail.com (Postfix) with ESMTP id 5EF641A0009 for ; Tue, 22 Feb 2022 08:56:24 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1645520183; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding; bh=QM3zUEJwwPAX1BW5AGBZqprgY/+aU6yDbvb8gGgjyfo=; b=DBbH2k0DQsXkLDe1MFbOU8xLKWOmDa7Y4Nubp3U1hbHfs2onXRarSZfar6/85XPmovfsIH eFRsBbzp2FEcVQpacdXEteciAZ20tLBQ0M61/k6OVM5YBd5oQXFm0k66Aa7f1AcTVL0whK bpsXQ+yJ1q07qCI528uz0g0GnivtvZE= Received: from mail-wm1-f70.google.com (mail-wm1-f70.google.com [209.85.128.70]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-195-DpkdbLWUN_yQPuh__e0HDQ-1; Tue, 22 Feb 2022 03:56:22 -0500 X-MC-Unique: DpkdbLWUN_yQPuh__e0HDQ-1 Received: by mail-wm1-f70.google.com with SMTP id h82-20020a1c2155000000b003552c13626cso568572wmh.3 for ; Tue, 22 Feb 2022 00:56:22 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:from:organization:to:subject :content-transfer-encoding; bh=QM3zUEJwwPAX1BW5AGBZqprgY/+aU6yDbvb8gGgjyfo=; b=bYUljbKL2feO5gvQhnnI1U8G6LsMeF8wrPqY96VIf/T0Zdz+kM8iLbSb3nAiVjoeEr cfD7wSleq46s4FTFpdr/ae6ScQkb39hgSDXLo7tBPcGRjidxFDaXQp2AUXsJKV1V4KBv wv7oJG6OLMo/5dkZ0b0fM7fAlZWFeinKtgU1Pwm3+JYwcrNzOkojLfPnPNKhae0uAQMy fZNRd69MItFfCx9HszekM7JQCyJx67k0E97nELLjZFkq2nYDDiBeBCxIbNyh401+kXUR auAcrEYlFjb/k8xOZHoyrE5VdEICc3uhjv06KWj5PxJHhFPgJc8iX21jhlp9aiHEgQ6h Qp8w== X-Gm-Message-State: AOAM530VSFEUeAlOUl0a9IZhir1OsQGBlqoAQNrQ3yMx6T0UZdbHQbts fl1NlEsM5ZAp9W7/YIwrEQwdIFPYTp32pyjmZ/0VQmdS9asDZ4LSgE7OCabXhOuRx2Puczp27FY B0r/cZWPzRPc= X-Received: by 2002:a05:6000:1882:b0:1d6:1d94:271e with SMTP id a2-20020a056000188200b001d61d94271emr19097929wri.555.1645520181258; Tue, 22 Feb 2022 00:56:21 -0800 (PST) X-Google-Smtp-Source: ABdhPJyyfDCwwlQPfKtzely2wPfiz0kkiexd4NsgksbMPEAq8V5MKpnomQNIEC+cRtvdzTrA3QV2Qw== X-Received: by 2002:a05:6000:1882:b0:1d6:1d94:271e with SMTP id a2-20020a056000188200b001d61d94271emr19097918wri.555.1645520180954; Tue, 22 Feb 2022 00:56:20 -0800 (PST) Received: from ?IPV6:2003:cb:c706:7d00:477e:35d5:928f:9b7f? (p200300cbc7067d00477e35d5928f9b7f.dip0.t-ipconnect.de. [2003:cb:c706:7d00:477e:35d5:928f:9b7f]) by smtp.gmail.com with ESMTPSA id b18sm44861004wrx.92.2022.02.22.00.56.20 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Tue, 22 Feb 2022 00:56:20 -0800 (PST) Message-ID: <7b908208-02f8-6fde-4dfc-13d5e00310a6@redhat.com> Date: Tue, 22 Feb 2022 09:56:20 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 From: David Hildenbrand Organization: Red Hat To: lsf-pc@lists.linux-foundation.org, "linux-mm@kvack.org" Subject: [LSF/MM/BPF TOPIC] page table reclaim X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Rspam-User: X-Rspamd-Server: rspam12 X-Rspamd-Queue-Id: 5EF641A0009 X-Stat-Signature: z7x3i7ja4ynsi9u79uy1hfxn4ebbwtz4 Authentication-Results: imf19.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DBbH2k0D; spf=none (imf19.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.129.124) smtp.mailfrom=david@redhat.com; dmarc=pass (policy=none) header.from=redhat.com X-HE-Tag: 1645520184-531982 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: Hi all, we are aware of workloads that can trigger allocation of a lot of page tables that are essentially unnecessary. The obvious candidates are processes that dynamically manage memory consumption in large, sparse memory mappings e.g., via madvise(MADV_DONTNEED): hypervisors that implement memory ballooning or virtio-mem, and memory allocators. In fact, it's easy to have a process that almost exclusively consumes page tables only, and it's hard to distinguish between "malicious" and "sane" workload when just looking at the page table consumption. I have quite some neat examples that I can present. Page tables are unmovable in memory an cannot get swapped out. So heavy page table consumption isn't only problematic because we end up wasting system RAM and fragmenting system RAM with unmovable allocations, it's also a problem when having big portions of system RAM managed by CMA/ZONE_MOVABLE where we can just run out of system RAM available for unmovable allocations and eventually harm the system / other workloads in the same machine. One approach I'd like to discuss is page table reclaim: reclaiming unnecessary page tables, which involves a lot of challenges. 1. Efficient page table reclaim "Ripping out" a page table is an expensive and highly complicated operation: just take a look at khugepaged. We have to block all page table walkers, which requires the mmap_lock in write mode, the rmap lock, and proper synchronization with GUP-fast. In the simplest approach, we'd scan for candidate page tables to then rip them out. But: * How to scan for candidate page tables efficiently? * How to avoid the mmap_lock in write mode when removing a page table? * How to avoid the rmap lock (just imagine a page table spanning multiple rmaps)? But also: how to make the implementation simple and appealing to get merged upstream? For example, the last attempt to reclaim empty PTE page tables [1] automatically once the last PTE was zapped was not merged yet because it certainly adds complexity. How to avoid that complexity? 2. Who triggers reclaim and when? Letting an application trigger reclaim of page tables is the "easiest solution": let's imagine madvise(MADV_RECLAIM_PGTABLES). However, this doesn't take care of malicious workloads and is more problematic when having sparse files mapped into multiple processes. Further, there is no need to reclaim if we're not under memory pressure. Letting the system do this automatically looks "cleaner". But, when to start reclaiming? How to detect and handle malicious processes (do we care?)? How to set an adequate soft/hard limit? 3. Which page tables to reclaim? While the obvious candidates are empty page tables, we can easily have page tables all filled with the shared zeropage instead. Once again, there are sane and malicious use cases. A sane use case is a simple VM having a balloon inflated and triggering a memory dump like kdump: we'll populate the shared zeropage everywhere and have plenty of page tables we don't even care about. But once we talk about reclaiming page tables that are still populated with the shared zeropage, why not reclaim page tables that are "reconstructable", for example, because they don't map anonymous pages and don't require special fault handling (userfaultfd?)? While I do have answers to some of the questions and various ideas, it's certainly an interesting topic to discuss and brainstorm. [1]https://lkml.kernel.org/r/20211110105428.32458-1-zhengqi.arch@bytedance.com -- Thanks, David / dhildenb