From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 10655C433EF for ; Thu, 7 Oct 2021 16:46:58 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 7ACD561029 for ; Thu, 7 Oct 2021 16:46:57 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.4.1 mail.kernel.org 7ACD561029 Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=redhat.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=kvack.org Received: by kanga.kvack.org (Postfix) id 164906B006C; Thu, 7 Oct 2021 12:46:57 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 114A46B0071; Thu, 7 Oct 2021 12:46:57 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id F1EC86B0072; Thu, 7 Oct 2021 12:46:56 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0180.hostedemail.com [216.40.44.180]) by kanga.kvack.org (Postfix) with ESMTP id E3F166B006C for ; Thu, 7 Oct 2021 12:46:56 -0400 (EDT) Received: from smtpin32.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id A29C428DC7 for ; Thu, 7 Oct 2021 16:46:56 +0000 (UTC) X-FDA: 78670220832.32.E09162B Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf22.hostedemail.com (Postfix) with ESMTP id 435071F74 for ; Thu, 7 Oct 2021 16:46:56 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1633625215; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=knvi0LpUORVHTDl2iRWayDlG6dCOhflKKRa42r69GJ4=; b=KZ+v9f9maww7x7mosxTp2YNAkW2xp0SBvRmWSo793UdzWApze1ga9SQhoaFjJxhU1+DE/9 1f7/m+EllMYIKAJIBTATPPm3ZPW8KDEC9uTyT81DJ9xNpF8I3RXz2bJyVy4AMkwiHE20LF jhY2cy6JA19Ks2mbYnAQBsLkgHnc3aU= Received: from mail-wr1-f69.google.com (mail-wr1-f69.google.com [209.85.221.69]) (Using TLS) by relay.mimecast.com with ESMTP id us-mta-376-c47hkqrsN8GjrPUArdaJMw-1; Thu, 07 Oct 2021 12:46:54 -0400 X-MC-Unique: c47hkqrsN8GjrPUArdaJMw-1 Received: by mail-wr1-f69.google.com with SMTP id a10-20020a5d508a000000b00160723ce588so5175512wrt.23 for ; Thu, 07 Oct 2021 09:46:54 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:subject:to:cc:references:from:organization :message-id:date:user-agent:mime-version:in-reply-to :content-language:content-transfer-encoding; bh=knvi0LpUORVHTDl2iRWayDlG6dCOhflKKRa42r69GJ4=; b=NXyooUcyE4tBZrE1bxZO/urrWK19nPMnnlgrvuZfut9XmcgrPz/gbpAjgCDUMXh4ZP fX3jO6ebzuADkL0niMObYX3vR4wjqn4qrbUHRL5u3OAthC768SYcl/d/5hIuMwlVPS20 GHVcYGnw4f7JZ8Gu4apr/syPGxLz1H4KoxEE+GbBOid82ks77jWn4AjM8XxQbC8PirOD z+BTwjV8XBxytXwl3l+ZKyEOfTzb98f/EGvbu8yN19lnc81J/HwePh7V3SeoWdeQlS0/ 5MQkDqB2pBb8Q1tsw0MfzlVVomOvPyrEf9yV+Lle9XyoMrkTKcCsnKhN+bj+G/aoeM7N ssTg== X-Gm-Message-State: AOAM533ftaattePvXwkDtWz0T+jqIE14mZOq6UJIauZVbODfu9rnJR4+ /yng+xPpg4N5w89xi5zBZfKEW6QIoBcyNVLBaCCYV+Ix/S8zsP4yc/rjL8glfMlAwxzT6QG9nP0 5nQWEmzr4oxU= X-Received: by 2002:a05:6000:2c6:: with SMTP id o6mr6955673wry.292.1633625213394; Thu, 07 Oct 2021 09:46:53 -0700 (PDT) X-Google-Smtp-Source: ABdhPJySoCN0Xa0lodfOTntiXdrCVF/isQq0EuvGiEU9iZ0RT2lcu3rHwmdZUu7EzjWbizLg1fkL5g== X-Received: by 2002:a05:6000:2c6:: with SMTP id o6mr6955632wry.292.1633625213078; Thu, 07 Oct 2021 09:46:53 -0700 (PDT) Received: from [192.168.3.132] (p5b0c6886.dip0.t-ipconnect.de. [91.12.104.134]) by smtp.gmail.com with ESMTPSA id g1sm9867875wmk.2.2021.10.07.09.46.52 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 07 Oct 2021 09:46:52 -0700 (PDT) Subject: Re: [RFC PATCH 0/8] mm/madvise: support process_madvise(MADV_DONTNEED) To: Nadav Amit Cc: Andrew Morton , Linux-MM , Linux Kernel Mailing List , Peter Xu , Andrea Arcangeli , Minchan Kim , Colin Cross , Suren Baghdasarya , Mike Rapoport References: <20210926161259.238054-1-namit@vmware.com> <7ce823c8-cfbf-cc59-9fc7-9aa3a79740c3@redhat.com> <6E8A03DD-175F-4A21-BCD7-383D61344521@gmail.com> <2753a311-4d5f-8bc5-ce6f-10063e3c6167@redhat.com> <9DE833C8-515F-4427-9867-E5BF9AD380FB@gmail.com> <9b53a85c-83f4-4548-c3b5-c65bd8737670@redhat.com> From: David Hildenbrand Organization: Red Hat Message-ID: <5a7da918-9be4-1e92-187c-f7b6e27c4dcd@redhat.com> Date: Thu, 7 Oct 2021 18:46:51 +0200 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:78.0) Gecko/20100101 Thunderbird/78.11.0 MIME-Version: 1.0 In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US X-Rspamd-Server: rspam05 X-Rspamd-Queue-Id: 435071F74 X-Stat-Signature: ctu6j7q883quw1w3z7bc7edhbaxsj5bd Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=KZ+v9f9m; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf22.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-HE-Tag: 1633625216-629454 Content-Transfer-Encoding: quoted-printable X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 07.10.21 18:19, Nadav Amit wrote: >=20 >=20 >> On Oct 4, 2021, at 10:58 AM, David Hildenbrand wrot= e: >> >>>> >>>> Thanks for the pointer. >>>> >>>> And my question would be if something like DAMON would actually be w= hat you want. >>> I looked into DAMON and even with the proposed future extensions it s= ounds >>> as a different approach with certain benefits but with many limitatio= ns. >>> The major limitation of DAMON is that you need to predefine the logic= you >>> want for reclamation into the kernel. You can add programability thro= ugh >>> some API or even eBPF, but it would never be as easy or as versatile = as >>> what user manager can achieve. We already have pretty much all the >>> facilities to do so from userspace, and the missing parts (at least f= or >>> basic userspace manager) are almost already there. In contrast, see h= ow >>> many iterations are needed for the basic DAMON implementation. >> >> I can see what you're saying when looking at optimizing a hand full of= special applications. I yet fail to see how something like that could wo= rk as a full replacement for in kernel swapping. I'm happy to learn. >=20 > I am not arguing it is a full replacement, at least at this stage. >=20 >> >>> The second, also big, difference is that DAMON looks only on reclamat= ion. >>> If you want a custom prefetch scheme or different I/O stack for backi= ng >>> storage, you cannot have such one. >> >> I do wonder if it could be extended for prefetching. But I am absolute= ly not a DAMON expert. >> >> [...] >=20 > These are 2 different approaches. One, is to provide some logic > for the kernel (DAMON). The other is to provide userspace full > control over paging operations (with caveats). Obviously, due to > the caveats, the kernel paging mechanism behaves as a backup. >=20 >> >>>> >>>> You raise a very excellent point (and it should have been part of yo= ur initial sales pitch): how does it differ to process_vm_writev(). >>>> >>>> I can say that it differs in a way that you can break applications i= n more extreme ways. Let me give you two examples: >>>> >>>> 1. longterm pinnings: you raised this yourself; this can break an ap= plication silently and there is barely a safe way your tooling could hand= le it. >>>> >>>> 2. pagemap: applications can depend on the populated(present |swap) = information in the pagemap for correctness. For example, there was recent= ly a discussion to use pagemap information to speed up live migration of = VMs, by skipping migration of !populated pages. There is currently no way= your tooling can fake that. In comparison, ordinary swapping in the kern= el can handle it. >>> I understand (1). As for (2): the scenario that you mention sound >>> very specific, and one can argue that ignoring UFFD-registered >>> regions in such a case is either (1) wrong or (2) should trigger >>> some UFFD event. >>>> >>>> Is it easy to break an application with process_vm_writev()? Yes. Wh= en talking about dynamic debugging, it's expected that you break the targ= et already -- or the target is already broken. Is it easier to break an a= pplication with process_madvise(MADV_DONTNEED)? I'd say yes, especially w= hen implementing something way beyond debugging as you describe. >>> If you do not know what you are doing, you can easily break anything. >>> Note that there are other APIs that can break your application even >>> worse, specifically ptrace(). >>>> I'm giving you "a hard time" for the reason Michal raised: we discus= sed this in the past already at least two times IIRC and "it is a free ti= cket to all sorts of hard to debug problem" in our opinion; especially wh= en we mess around in other process address spaces besides for debugging. >>>> >>>> I'm not the person to ack/nack this, I'm just asking the questions := ) >>> I see your points and I try to look for a path of least resistance. >>> I thought that process_madvise() is a nice interface to hook into. >> >> It would be the right interface -- iff the operation wouldn't have a b= ad smell to it. We don't really want applications to mess around in the p= age table layout of some other process: however, that is exactly what you= require. By unlocking that interface for that use case we agree that wha= t you are proposing is a "sane use case", but ... >> >>> But if you are concerned it will be misused, how about adding instead >>> an IOCTL that will zap pages but only in UFFD-registered regions? >>> A separate IOCTL for this matter have an advantage of being more >>> tailored for UFFD, not to notify UFFD upon =E2=80=9Cremove=E2=80=9D a= nd to be less >>> likely to be misused. >> >> ... that won't change the fact that with your user-space swapping appr= oach that requires this interface we can break some applications silently= , and that's really the major concern I have. >> >> I mean, there are more cases where you can just harm the target applic= ation I think, for example if the target application uses SOFTDIRTY track= ing. >> >> >> To judge if this is a sane use case we want to support, it would help = a lot if there would be actual code+evaluation when actually implementing= some of these advanced policies. Because you raise a lot of interesting = points in your reply to Michal to back your use case, and naive me thinks= "this sounds interesting but ... aren't we losing a lot of flexibility+f= eatures when doing this in user space? Does anyone actually want to do it= like that?". >> >> Again, I'm not the person to ack/nack this, I'm just questioning if th= e use case that requires this interface is actually something that will g= et used later in real life because it has real advantages, or if it's a p= ure research project that will get abandoned at some point and we ended u= p exposing an interface we really didn't want to expose so far (especiall= y, because all other requests so far were bogus). >=20 > I do want to release the code, but it is really > incomplete/immature at this point. I would not that there additional > use cases, such as workloads that have discardable cache (or memoizatio= n > data), which want a central/another entity to discard the data when > there is memory pressure. (You can think about it as a userspace > shrinker). >=20 > Anyhow, as a path of least resistance, I think I would do the > following: >=20 > 1. Wait for the other madvise related patches to be applied. > 2. Simplify the patches, specifically removing the data structure > changes based on Kirill feedback. > 3. Defer the enablement of the MADV_DONTNEED until I can show > code/performance numbers. Sounds excellent, for your project to make progress at this stage I=20 assume this stuff doesn't have to be upstream, but it's good to discuss=20 upstream-ability. Happy to learn more once you have more details to share. --=20 Thanks, David / dhildenb