From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 784A8C433EF for ; Mon, 22 Nov 2021 09:19:40 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id EAF556B0072; Mon, 22 Nov 2021 04:19:24 -0500 (EST) Received: by kanga.kvack.org (Postfix, from userid 40) id E5F886B0073; Mon, 22 Nov 2021 04:19:24 -0500 (EST) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id D27C46B0074; Mon, 22 Nov 2021 04:19:24 -0500 (EST) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0119.hostedemail.com [216.40.44.119]) by kanga.kvack.org (Postfix) with ESMTP id C29406B0072 for ; Mon, 22 Nov 2021 04:19:24 -0500 (EST) Received: from smtpin16.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay04.hostedemail.com (Postfix) with ESMTP id 8BCD58B76C for ; Mon, 22 Nov 2021 09:19:14 +0000 (UTC) X-FDA: 78836017428.16.187AD88 Received: from us-smtp-delivery-124.mimecast.com (us-smtp-delivery-124.mimecast.com [170.10.133.124]) by imf16.hostedemail.com (Postfix) with ESMTP id D433BF00009E for ; Mon, 22 Nov 2021 09:19:10 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=redhat.com; s=mimecast20190719; t=1637572753; h=from:from:reply-to:subject:subject:date:date:message-id:message-id: to:to:cc:cc:mime-version:mime-version:content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references; bh=6kIGMoi/UizS8RxyuA9y/8Vg2kuqpTeq7QsPmGlDpPI=; b=DN4FmgXDS/qJruSuEZ9V11wcQVWelHFkK6hkJFK3IOPLvpQiquKxI4rDVl8tInbcCkkxso V4N6EV/8hGgv8OW4tKz+rXqV1K1SeupB1vUiDQle1xl6B7eqbIJhDrL2zqMmkBFUsMIiqa 8Z/+DrTq6vH1w71B8GSPolaeBjfvC5A= Received: from mail-wm1-f71.google.com (mail-wm1-f71.google.com [209.85.128.71]) by relay.mimecast.com with ESMTP with STARTTLS (version=TLSv1.2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id us-mta-252-oqjYkPHCOZKTYjPKGEvL2w-1; Mon, 22 Nov 2021 04:19:11 -0500 X-MC-Unique: oqjYkPHCOZKTYjPKGEvL2w-1 Received: by mail-wm1-f71.google.com with SMTP id p12-20020a05600c1d8c00b0033a22e48203so4511632wms.6 for ; Mon, 22 Nov 2021 01:19:11 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:message-id:date:mime-version:user-agent :content-language:to:cc:references:from:organization:subject :in-reply-to:content-transfer-encoding; bh=6kIGMoi/UizS8RxyuA9y/8Vg2kuqpTeq7QsPmGlDpPI=; b=HrcpqQIxedlLNmFFoi0C6RHouNW8OfLlMzK97YdS4eJmGpQW05BSsc8VmBAm3JjIju jSFhy5WIZPKc/MF0X7RIOG0cnf3u8kLB4VPWdhpqza/htOGe0MeKBzzd6BIdHTQ41HDq dUWPO5kd1gyP+BshqE+0dV+zJWX8Yl3huDTsoYW4BoRJxEe+mkELO9T2DISDQJTAuYR3 ezeBr7RcXU+9MGRcBRJ0Uh/V02gHn1ccL+mrNEBZrRAm4Mb8pgvl6n1Drje8MC1npSbI V6EYG9HWCGlwr8OQ6z5ZgUwZTUgdj9hXk0ydU10yCv8jPkC6g9RJVH5kID8JTgq+FZrM rKug== X-Gm-Message-State: AOAM532HDw0dg3NPqKd/Wvnei21JLiQv/xi/JOjy5N7nPGZ5PeJM/IyD ka4FVsdXCIjpxhDFDkWUKUIfZ/XRfIAQstGmEWpyjUDRrTZaUtiKohU0W18EHXB+Bxz3CXu44BF 5YD4yw+VY77U= X-Received: by 2002:adf:e682:: with SMTP id r2mr37379259wrm.281.1637572750479; Mon, 22 Nov 2021 01:19:10 -0800 (PST) X-Google-Smtp-Source: ABdhPJx52seET0Ug9wVWNIeAfQ3a0zvPtLtb9P2xYSntHccg2zBLkncoMO3OpWt9HvSfX6WUewxZYA== X-Received: by 2002:adf:e682:: with SMTP id r2mr37379222wrm.281.1637572750237; Mon, 22 Nov 2021 01:19:10 -0800 (PST) Received: from [192.168.3.132] (p5b0c667b.dip0.t-ipconnect.de. [91.12.102.123]) by smtp.gmail.com with ESMTPSA id az4sm21196566wmb.20.2021.11.22.01.19.09 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Mon, 22 Nov 2021 01:19:09 -0800 (PST) Message-ID: <861f98b5-9211-98c7-b4f7-fd71146aa64c@redhat.com> Date: Mon, 22 Nov 2021 10:19:09 +0100 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.2.0 To: Matthew Wilcox , Shakeel Butt Cc: "Kirill A . Shutemov" , Yang Shi , Zi Yan , Andrew Morton , linux-mm@kvack.org, linux-kernel@vger.kernel.org References: <20211120201230.920082-1-shakeelb@google.com> From: David Hildenbrand Organization: Red Hat Subject: Re: [PATCH] mm: split thp synchronously on MADV_DONTNEED In-Reply-To: X-Mimecast-Spam-Score: 0 X-Mimecast-Originator: redhat.com Content-Language: en-US Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Stat-Signature: 73hqdsthno8xozh99ukpukrxds8z3j5h Authentication-Results: imf16.hostedemail.com; dkim=pass header.d=redhat.com header.s=mimecast20190719 header.b=DN4FmgXD; dmarc=pass (policy=none) header.from=redhat.com; spf=none (imf16.hostedemail.com: domain of david@redhat.com has no SPF policy when checking 170.10.133.124) smtp.mailfrom=david@redhat.com X-Rspamd-Server: rspam03 X-Rspamd-Queue-Id: D433BF00009E X-HE-Tag: 1637572750-202088 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 22.11.21 05:56, Matthew Wilcox wrote: > On Sat, Nov 20, 2021 at 12:12:30PM -0800, Shakeel Butt wrote: >> Many applications do sophisticated management of their heap memory for >> better performance but with low cost. We have a bunch of such >> applications running on our production and examples include caching and >> data storage services. These applications keep their hot data on the >> THPs for better performance and release the cold data through >> MADV_DONTNEED to keep the memory cost low. >> >> The kernel defers the split and release of THPs until there is memory >> pressure. This causes complicates the memory management of these >> sophisticated applications which then needs to look into low level >> kernel handling of THPs to better gauge their headroom for expansion. In >> addition these applications are very latency sensitive and would prefer >> to not face memory reclaim due to non-deterministic nature of reclaim. >> >> This patch let such applications not worry about the low level handling >> of THPs in the kernel and splits the THPs synchronously on >> MADV_DONTNEED. > > I've been wondering about whether this is really the right strategy > (and this goes wider than just this one, new case) > > We chose to use a 2MB page here, based on whatever heuristics are > currently in play. Now userspace is telling us we were wrong and should > have used smaller pages. IIUC, not necessarily, unfortunately. User space might be discarding the whole 2MB either via a single call (MADV_DONTNEED a 2MB range as done by virtio-balloon with "free page reporting" or by virtio-mem in QEMU). In that case, there is nothing to migrate and we were not doing anything wrong. But more extreme, user space might be discarding the whole THP in small pieces shortly over time. This for example happens when a VM inflates the memory balloon via virtio-balloon. All inflation requests are 4k, resulting in a 4k MADV_DONTNEED calls. If we end up inflating a THP range inside of the VM, mapping to a THP range inside the hypervisor, we'll essentially free a THP in the hypervisor piece by piece using individual MADV_DONTNEED calls -- this happens frequently. Something similar can happen when de-fragmentation inside the VM "moves around" inflated 4k pages piece by piece to essentially form a huge inflated range -- this happens less frequently as of now. In both cases, migration is counter-productive, as we're just about to free the whole range either way. (yes, there are ways to optimize, for example using hugepage ballooning or merging MADV_DONTNEED calls in the hypervisor, but what I described is what we currently implement in hypervisors like QEMU, because there are corner cases for everything) Long story short: it's hard to tell what will happen next based on a single MADV_DONTNEED call. Page compaction, in comparison, doesn't care and optimized the layout as it observes it. -- Thanks, David / dhildenb