From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 086A1C433F5 for ; Mon, 10 Oct 2022 14:47:12 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id 460898E0001; Mon, 10 Oct 2022 10:47:12 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id 40F956B0073; Mon, 10 Oct 2022 10:47:12 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 2B0748E0001; Mon, 10 Oct 2022 10:47:12 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0011.hostedemail.com [216.40.44.11]) by kanga.kvack.org (Postfix) with ESMTP id 126016B0071 for ; Mon, 10 Oct 2022 10:47:12 -0400 (EDT) Received: from smtpin12.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay09.hostedemail.com (Postfix) with ESMTP id CAD5A80692 for ; Mon, 10 Oct 2022 14:47:11 +0000 (UTC) X-FDA: 80005317462.12.DD0ECC7 Received: from mail-pf1-f171.google.com (mail-pf1-f171.google.com [209.85.210.171]) by imf22.hostedemail.com (Postfix) with ESMTP id DD977C0026 for ; Mon, 10 Oct 2022 14:47:09 +0000 (UTC) Received: by mail-pf1-f171.google.com with SMTP id i3so10943172pfk.9 for ; Mon, 10 Oct 2022 07:47:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=RmLmBuwnxY8079sNUZWy0seIJQJaeaJcXLUydQ13od4=; b=lMC87aE7ClxvxqaDEhP79ZEp0tBmRmu+BkcZr8zS6xu7nIHlEsB9ZJ3vb/IuWuD+pf 28WviVdssVyD/ndnYFGbKdG5ahbgil7XM/o0TJypCqzm4AUwRAypqrw7xakYSEslJrq3 Y9Ms+9vY1Tuxeq7xs7wfdlQtWVg55I2MF1EvZsyBpgY4lCGtx54E51q9gXSyCwR6es5+ w5QsDzPM2nVquMXELBOKVv0oD72W1+MXizrXbs/d9rfWd7Ra2sKm6wHSvhBBXDrs+Bgo Bj/zEHwd2Y21i4EeSOKQ2X/EwkdYdpLTcp1k5xzfD1etNR35nS9nfZqGMcuox5Onolnd FrFw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=to:references:message-id:content-transfer-encoding:cc:date :in-reply-to:from:subject:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=RmLmBuwnxY8079sNUZWy0seIJQJaeaJcXLUydQ13od4=; b=qjrdxvUgRirRgRX6MzZbZ6qhmWUzoqNmsYSKcTJGIJ8BrME0RZR04QckwjdfaWd+31 GpQbu8jaofsj4he9GLiqB9k8vsifWpeh0xxESKsPYa7Fl3fdLz8jgreZH8pwKOtajE+A KtyNRVvApWdp7SloPg0nNodMR5SI1Q1UT5B0elj0a6plehSq7wj//qhVuuz+lQQKv1I5 NN6NSkoo+5ZSje0umRGSCYIRrmoYo3TVV+MWC1QrLHAldCmYOWgzjLZBOpHkTRyVsKHK lQPOWTDdzw8joluGPyL4b7EKrTxfGjDyaDk7kcJstmVpwH+yzEZZA/M1ubaBxnPAKugG x9ug== X-Gm-Message-State: ACrzQf2afztO0M7o5/3M243CKvuCWz39OywOUg6f6Yn3Rmvi8tmHDzOM 1GfuFAAJTUV8D2MCUdcJ7J4= X-Google-Smtp-Source: AMsMyM6KkS7WCUWfdYXh1/HGf88nF6b5wKp19VPh6dCGXHtc8IvTX6/LKE4CtCGKv72UH4EPHy1pLQ== X-Received: by 2002:a63:6cca:0:b0:43c:7998:8a78 with SMTP id h193-20020a636cca000000b0043c79988a78mr16766082pgc.600.1665413228373; Mon, 10 Oct 2022 07:47:08 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id i4-20020a63bf44000000b0044046aec036sm6247938pgo.81.2022.10.10.07.47.07 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 10 Oct 2022 07:47:07 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 16.0 \(3696.120.41.1.1\)) Subject: Re: RFC [PATCH v4 2/7] Enable balloon drivers to report inflated memory From: Nadav Amit In-Reply-To: Date: Mon, 10 Oct 2022 07:47:06 -0700 Cc: "Michael S. Tsirkin" , David Hildenbrand , Andrew Morton , kernel@openvz.org, Linux Virtualization , LKML , Linux MM Content-Transfer-Encoding: quoted-printable Message-Id: <1118F098-972A-4F58-8EE1-270A06E4F9D1@gmail.com> References: <20221005090158.2801592-1-alexander.atanasov@virtuozzo.com> <20221005090158.2801592-3-alexander.atanasov@virtuozzo.com> <88EDC41D-408F-4ADF-A933-0A6F36E5F262@gmail.com> <42C75E59-696B-41D5-BD77-68EFF0B075C6@gmail.com> <71E14334-CA3B-45FB-A854-7A8D6649C798@gmail.com> To: Alexander Atanasov X-Mailer: Apple Mail (2.3696.120.41.1.1) ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1665413229; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type: content-transfer-encoding:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=RmLmBuwnxY8079sNUZWy0seIJQJaeaJcXLUydQ13od4=; b=qgz5syhoDlRcUlgAVBw2ZiTbnSHNTKuVeEhFLpAiCKNQ5ftxxY3PHdSA+mw0xTeEPmIvr2 FqBcCszkLK89zhwz0EiX8Hr4XI1Nh7T0HpB40wCogIzmaaP5xcZN033H72gmCgvCCl8y+F XZOcsbddxHv1o1W6jBMB5szn0hrSB8A= ARC-Authentication-Results: i=1; imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=lMC87aE7; spf=pass (imf22.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1665413229; a=rsa-sha256; cv=none; b=cWLtf+MloDIwCj/DxnUfAHnZ0wx/Vd2Zl5XNR6RypC9eE11oh3aHrDwEqFOdlIED/98M88 ZVGR5VGw8nXPlte0BxAtGcy/oA00JCxw51rfL3QiAN1Mz2AqsRL3EDCmpsptj9uUmK1DAc xIngdD98jg5ynk7zBA1KHSgyaS+QSy8= Authentication-Results: imf22.hostedemail.com; dkim=pass header.d=gmail.com header.s=20210112 header.b=lMC87aE7; spf=pass (imf22.hostedemail.com: domain of nadav.amit@gmail.com designates 209.85.210.171 as permitted sender) smtp.mailfrom=nadav.amit@gmail.com; dmarc=pass (policy=none) header.from=gmail.com X-Rspam-User: X-Stat-Signature: xx18ase6ho548aenuqo51h9jfpomkram X-Rspamd-Queue-Id: DD977C0026 X-Rspamd-Server: rspam01 X-HE-Tag: 1665413229-13909 X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On Oct 10, 2022, at 12:24 AM, Alexander Atanasov = wrote: > Hello, >=20 > On 10.10.22 9:18, Nadav Amit wrote: >> On Oct 7, 2022, at 3:58 AM, Alexander Atanasov = wrote: >>> So all balloon drivers give large amount of RAM on boot , then = inflate the balloon. But this places have already been initiallized and = they know that the system have given amount of totalram which is not = true the moment they start to operate. the result is that too much space = gets used and it degrades the userspace performance. >>> example - fs/eventpoll.c:static int __init eventpoll_init(void) - 4% = of ram for eventpool - when you inflate half of the ram it becomes 8% of = the ram - do you really need 8% of your ram to be used for eventpool? >>>=20 >>> To solve this you need to register and when notified update - cache = size, limits and for whatever is the calculated amount of memory used. >> Hmm.. Not sure about all of that. Most balloon drivers are manually = managed, >> and call adjust_managed_page_count(), and tas a result might want to = redo >> all the calculations that are based on totalram_pages(). >=20 > Yes, i will say that it looks like mixed manual - for large changes = and automatic for small changes. VMWare and HyperV have automatic and = manual/not sure exactly what you can change on a running VM but i guess = you can/ - Virtio is only manual. I do not know about dlpar / xen. >=20 > Scenario is like this start a VM with 4GB ram, reduce to 2GB with = balloon - vm can be upgraded. >=20 > All we are talking about relates to memory hotplug/unplug /where = unplug is close to nonexistant hence the balloons are used./ >=20 > All values should be recalculated on memory hotplug too, so you can = use the newly available RAM. >=20 > RAM is the most valuable resource of all so i consider using it = optimally to be of a great importance. >=20 >> Side-note: That=E2=80=99s not the case for VMware balloon. I actually = considered >> calling adjust_managed_page_count() just to conform with other = balloon >> drivers. But since we use totalram_pages() to communicate to the = hypervisor >> the total-ram, this would create endless (and wrong) feedback loop. I = am not >> claiming it is not possible to VMware balloon driver to call >> adjust_managed_page_count(), but the chances are that it would create = more >> harm than good. >=20 > Virtio does both - depending on the deflate on OOM option. I suggested = already to unify all drivers to inflate the used memory as it seems more = logical to me since no body expects the totalram_pages() to change but = the current state is that both ways are accepted and if changed can = break existing users. > See discussion here = https://lore.kernel.org/lkml/20220809095358.2203355-1-alexander.atanasov@v= irtuozzo.com/. Thanks for the reminder. I wish you can somehow summarize all of that = into the cover-letter and/or the commit messages for these patches. >=20 >=20 >> Back to the matter at hand. It seems that you wish that the notifiers = would >> be called following any changes that would be reflected in = totalram_pages(). >> So, doesn't it make more sense to call it from = adjust_managed_page_count() ? >=20 > It will hurt performance - all drivers work page by page , i.e. they = update by +1/-1 and they do so under locks which as you already noted = can lead to bad things. The notifier will accumulate the change and let = its user know how much changed, so the can decide if they have to = recalculate - it even can do so async in order to not disturb the = drivers. So updating the counters by 1 is ok (using atomic operation, which is = not free)? And the reason it is (relatively) cheap is because nobody = actually looks at the value (i.e., nobody actually acts on the value)? If nobody considers the value, then doesn=E2=80=99t it make sense just = to update it less frequently, and then call the notifiers? >>> The difference is here: >>>=20 >>> mm/zswap.c: return totalram_pages() * zswap_max_pool_percent / = 100 < >>> mm/zswap.c: return totalram_pages() * zswap_accept_thr_percent / = 100 >>> uses percents and you can recalculate easy with >>>=20 >>> +static inline unsigned long totalram_pages_current(void) >>> +{ >>> + unsigned long inflated =3D 0; >>> +#ifdef CONFIG_MEMORY_BALLOON >>> + extern atomic_long_t mem_balloon_inflated_free_kb; >>> + inflated =3D = atomic_long_read(&mem_balloon_inflated_free_kb); >>> + inflated >>=3D (PAGE_SHIFT - 10); >>> +#endif >>> + return (unsigned long)atomic_long_read(&_totalram_pages) - = inflated; >>> +} So we have here two values and it appears there is a hidden assumption = that they are both updated atomically. Otherwise, it appears, inflated theoretically might be greater that _totalram_pages dn we get negative = value and all hell breaks loose. But _totalram_pages and mem_balloon_inflated_free_kb are not updated atomically together (each one is, but not together). >>> And you are good when you switch to _current version - = si_meminfo_current is alike . >>>=20 >>> On init (probably) all use some kind of fractions to calculate but = when there is a set value via /proc/sys/net/ipv4/tcp_wmem for example it = is just a value and you can not recalculate it. And here, please, share = your ideas how to solve this. >> I don=E2=80=99t get all of that. Now that you provided some more = explanations, it >> sounds that what you want is adjust_managed_page_count(), which we = already >> have and affects the output of totalram_pages(). Therefore, = totalram_pages() >> anyhow accounts for the balloon memory (excluding VMware=E2=80=99s). = So why do we >> need to take mem_balloon_inflated_free_kb into account? > Ok, you have this: > / totalram > |----used----|b1|----free------|b2| >=20 > drivers can inflate both b1 and b2 - b1 free gets smaller, b2 totalram = pages get smaller. so when you need totalram_pages() to do a calculation = you need to adjust it with the pages that are inflated in free/used = (b1). VMWare is not exception , Virtio does the same. > And according to to mst and davidh it is okay like this. > So i am proposing a way to handle both cases. Ugh. What about BALLOON_INFLATE and BALLOON_DEFLATE vm-events? Can=E2=80=99= t this information be used instead of yet another counter? Unless, of course, = you get the atomicity that I mentioned before. >> Sounds to me that all you want is some notifier to be called from >> adjust_managed_page_count(). What am I missing? >=20 > Notifier will act as an accumulator to report size of change and it = will make things easier for the drivers and users wrt locking. > Notifier is similar to the memory hotplug notifier. Overall, I am not convinced that there is any value of separating the = value and the notifier. You can batch both or not batch both. In addition, as = I mentioned, having two values seems racy.