From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-pf0-f198.google.com (mail-pf0-f198.google.com [209.85.192.198]) by kanga.kvack.org (Postfix) with ESMTP id 7DE786B000A for ; Fri, 29 Jun 2018 11:55:48 -0400 (EDT) Received: by mail-pf0-f198.google.com with SMTP id t10-v6so1482887pfh.0 for ; Fri, 29 Jun 2018 08:55:48 -0700 (PDT) Received: from mga03.intel.com (mga03.intel.com. [134.134.136.65]) by mx.google.com with ESMTPS id a72-v6si8406297pge.497.2018.06.29.08.55.47 for (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 29 Jun 2018 08:55:47 -0700 (PDT) From: "Wang, Wei W" Subject: RE: [PATCH v34 0/4] Virtio-balloon: support free page reporting Date: Fri, 29 Jun 2018 15:55:04 +0000 Message-ID: <286AC319A985734F985F78AFA26841F7396C254C@shsmsx102.ccr.corp.intel.com> References: <1529928312-30500-1-git-send-email-wei.w.wang@intel.com> <5B35ACD5.4090800@intel.com> <4840cbb7-dd3f-7540-6a7c-13427de2f0d1@redhat.com> <5B36189E.5050204@intel.com> <34bb25eb-97f3-8a9f-8a13-401dfcf39a2c@redhat.com> In-Reply-To: <34bb25eb-97f3-8a9f-8a13-401dfcf39a2c@redhat.com> Content-Language: en-US Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 Sender: owner-linux-mm@kvack.org List-ID: To: David Hildenbrand , "virtio-dev@lists.oasis-open.org" , "linux-kernel@vger.kernel.org" , "virtualization@lists.linux-foundation.org" , "kvm@vger.kernel.org" , "linux-mm@kvack.org" , "mst@redhat.com" , "mhocko@kernel.org" , "akpm@linux-foundation.org" Cc: "torvalds@linux-foundation.org" , "pbonzini@redhat.com" , "liliang.opensource@gmail.com" , "yang.zhang.wz@gmail.com" , "quan.xu0@gmail.com" , "nilal@redhat.com" , "riel@redhat.com" , "peterx@redhat.com" , Andrea Arcangeli , Luiz Capitulino On Friday, June 29, 2018 7:54 PM, David Hildenbrand wrote: > On 29.06.2018 13:31, Wei Wang wrote: > > On 06/29/2018 03:46 PM, David Hildenbrand wrote: > >>> > >>> I'm afraid it can't. For example, when we have a guest booted, > >>> without too many memory activities. Assume the guest has 8GB free > >>> memory. The arch_free_page there won't be able to capture the 8GB > >>> free pages since there is no free() called. This results in no free p= ages > reported to host. > >> > >> So, it takes some time from when the guest boots up until the balloon > >> device was initialized and therefore page hinting can start. For that > >> period, you won't get any arch_free_page()/page hinting callbacks, cor= rect. > >> > >> However in the hypervisor, you can theoretically track which pages > >> the guest actually touched ("dirty"), so you already know "which > >> pages were never touched while booting up until virtio-balloon was > >> brought to life". These, you can directly exclude from migration. No > >> interface required. > >> > >> The remaining problem is pages that were touched ("allocated") by the > >> guest during bootup but freed again, before virtio-balloon came up. > >> One would have to measure how many pages these usually are, I would > >> say it would not be that many (because recently freed pages are > >> likely to be used again next for allocation). However, there are some > >> pages not being reported. > >> > >> During the lifetime of the guest, this should not be a problem, > >> eventually one of these pages would get allocated/freed again, so the > >> problem "solves itself over time". You are looking into the special > >> case of migrating the VM just after it has been started. But we have > >> the exact same problem also for ordinary free page hinting, so we > >> should rather solve that problem. It is not migration specific. > >> > >> If we are looking for an alternative to "problem solves itself", > >> something like "if virtio-balloon comes up, it will report all free > >> pages step by step using free page hinting, just like we would have > >> from "arch_free_pages()"". This would be the same interface we are > >> using for free page hinting - and it could even be made configurable i= n the > guest. > >> > >> The current approach we are discussing internally for details about > >> Nitesh's work ("how the magic inside arch_fee_pages() will work > >> efficiently) would allow this as far as I can see just fine. > >> > >> There would be a tiny little window between virtio-balloon comes up > >> and it has reported all free pages step by step, but that can be > >> considered a very special corner case that I would argue is not worth > >> it to be optimized. > >> > >> If I am missing something important here, sorry in advance :) > >> > > > > Probably I didn't explain that well. Please see my re-try: > > > > That work is to monitor page allocation and free activities via > > arch_alloc_pages and arch_free_pages. It has per-CPU lists to record > > the pages that are freed to the mm free list, and the per-CPU lists > > dump the recorded pages to a global list when any of them is full. > > So its own per-CPU list will only be able to get free pages when there > > is an mm free() function gets called. If we have 8GB free memory on > > the mm free list, but no application uses them and thus no mm free() > > calls are made. In that case, the arch_free_pages isn't called, and no > > free pages added to the per-CPU list, but we have 8G free memory right > > on the mm free list. > > How would you guarantee the per-CPU lists have got all the free pages > > that the mm free lists have? >=20 > As I said, if we have some mechanism that will scan the free pages (not > arch_free_page() once and report hints using the same mechanism step by > step (not your bulk interface)), this problem is solved. And as I said, t= his is > not a migration specific problem, we have the same problem in the current > page hinting RFC. These pages have to be reported. >=20 > > > > - I'm also worried about the overhead of maintaining so many per-CPU > > lists and the global list. For example, if we have applications > > frequently allocate and free 4KB pages, and each per-CPU list needs to > > implement the buddy algorithm to sort and merge neighbor pages. Today > > a server can have more than 100 CPUs, then there will be more than 100 > > per-CPU lists which need to sync to a global list under a lock, I'm > > not sure if this would scale well. >=20 > The overhead in the current RFC is definitely too high. But I consider th= is a > problem to be solved before page hinting would go upstream. And we are > discussing right now "if we have a reasonable page hinting implementation= , > why would we need your interface in addition". >=20 > > > > - This seems to be a burden imposed on the core mm memory > > allocation/free path. The whole overhead needs to be carried during > > the whole system life cycle. What we actually expected is to just make > > one call to get the free page hints only when live migration happens. >=20 > You're focusing too much on the actual implementation of the page hinting > RFC right now. Assume for now that we would have > - efficient page hinting without degrading other CPUs and little > overhead > - a mechanism that solves reporting free pages once after we started up > virtio-balloon and actual free page hinting starts >=20 > Why would your suggestion still be applicable? >=20 > Your point for now is "I might not want to have page hinting enabled due = to > the overhead, but still a live migration speedup". If that overhead actua= lly > exists (we'll have to see) or there might be another reason to disable pa= ge > hinting, then we have to decide if that specific setup is worth it mergin= g your > changes. All the above "if we have", "assume we have" don't sound like a valid argum= ent to me. =20 > I am not (and don't want to be) in the position to make any decisions her= e :) I > just want to understand if two interfaces for free pages actually make se= nse. I responded to Nitesh about the differences, you may want to check with him= about this. I would suggest you to send out your patches to LKML to get a discussion wi= th the mm folks. Best, Wei