From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by smtp.lore.kernel.org (Postfix) with ESMTP id 624D0EB64D9 for ; Thu, 15 Jun 2023 08:40:03 +0000 (UTC) Received: by kanga.kvack.org (Postfix) id AA05F6B0072; Thu, 15 Jun 2023 04:40:02 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id A516A6B0074; Thu, 15 Jun 2023 04:40:02 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id 918636B0078; Thu, 15 Jun 2023 04:40:02 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from relay.hostedemail.com (smtprelay0014.hostedemail.com [216.40.44.14]) by kanga.kvack.org (Postfix) with ESMTP id 814306B0072 for ; Thu, 15 Jun 2023 04:40:02 -0400 (EDT) Received: from smtpin11.hostedemail.com (a10.router.float.18 [10.200.18.1]) by unirelay02.hostedemail.com (Postfix) with ESMTP id 5269C120B2F for ; Thu, 15 Jun 2023 08:40:02 +0000 (UTC) X-FDA: 80904334644.11.79F905E Received: from mga12.intel.com (mga12.intel.com [192.55.52.136]) by imf03.hostedemail.com (Postfix) with ESMTP id 011942001C for ; Thu, 15 Jun 2023 08:39:59 +0000 (UTC) Authentication-Results: imf03.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YYHhGBFv; spf=pass (imf03.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=hostedemail.com; s=arc-20220608; t=1686818400; h=from:from:sender:reply-to:subject:subject:date:date: message-id:message-id:to:to:cc:cc:mime-version:mime-version: content-type:content-type:content-transfer-encoding: in-reply-to:in-reply-to:references:references:dkim-signature; bh=+h4pJ6h4KwEOdexNg1ES7XnahKGs8bo6ZM4JSbVCzu4=; b=rcwND+WbmzGBh5FhFrQvuQymTY2AtkaiW42R0/LNmbhRji2XiHdyvLSGCyvoDuNvn8uwSx VLqS7HaM3t6+sBDFe3VjejDI76x9xVV/2CtFyUwZ5bAr+AxKVxqREBlKdBaGJQdH3ggC8o WVSS00YSqizUi3nhJA298lqdkwzh14E= ARC-Seal: i=1; s=arc-20220608; d=hostedemail.com; t=1686818400; a=rsa-sha256; cv=none; b=xrM+GiJBi6SQiDxn+kQaN5Zb1YPwixRwuw76ILHkGUWLzORgg1Gv8LvteV1Tuei87P4vCg FGb9or6r6twatRdOly1/nWis/ep4dRAV8oIrNkigTTO2nrQ4yCfBc3q7Uyqo8/qObITKfM JLHsWzo6RctTpvsY4PYRnfCtBiXBjo8= ARC-Authentication-Results: i=1; imf03.hostedemail.com; dkim=pass header.d=intel.com header.s=Intel header.b=YYHhGBFv; spf=pass (imf03.hostedemail.com: domain of ying.huang@intel.com designates 192.55.52.136 as permitted sender) smtp.mailfrom=ying.huang@intel.com; dmarc=pass (policy=none) header.from=intel.com DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1686818400; x=1718354400; h=from:to:cc:subject:references:date:in-reply-to: message-id:mime-version; bh=o2eOJnsU2nuFZcaG4NHXOrDBiDTgP3ZlwkknmbtbcTc=; b=YYHhGBFvPycOlj6jCptjz1V4ZewPyh4z6jUBH8g2265WG0rII2h6//eX M00MHptcGy3v4HNaZwVwoZ1iVA5DbM3WXHm/bTpnf4ns7kiW/BuGHxFh5 99PiS+eRNj+21aorDKZiFau0GmPb0LGeo9U7PSFC7a2rCDWwJOJxi09Xp XtPaiSR7b2qtRnHSgpBkfFlo8OnSYV8XXE1hFkYUAA4rE1iRQ8HLkLWh1 nUSE+YWScu3DxDgDDnNak2MrcvZYiWX5VbbYCcPYwqpU5/XhTFssk6D3n n4zzTlbQvnRzjmSBRU7GfLmcma3ehSFIzDF0ZFZ5pWz4TWdBxVnpqlQDf w==; X-IronPort-AV: E=McAfee;i="6600,9927,10741"; a="338478229" X-IronPort-AV: E=Sophos;i="6.00,244,1681196400"; d="scan'208";a="338478229" Received: from orsmga007.jf.intel.com ([10.7.209.58]) by fmsmga106.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Jun 2023 01:39:58 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10741"; a="706575183" X-IronPort-AV: E=Sophos;i="6.00,244,1681196400"; d="scan'208";a="706575183" Received: from yhuang6-desk2.sh.intel.com (HELO yhuang6-desk2.ccr.corp.intel.com) ([10.238.208.55]) by orsmga007-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Jun 2023 01:39:50 -0700 From: "Huang, Ying" To: David Hildenbrand Cc: Baolin Wang , Mel Gorman , akpm@linux-foundation.org, vbabka@suse.cz, linux-mm@kvack.org, linux-kernel@vger.kernel.org Subject: Re: [PATCH v2] mm: compaction: skip memory hole rapidly when isolating migratable pages References: <770f9f61472b24b6bc89adbd71a77d9cf62bb54f.1686646361.git.baolin.wang@linux.alibaba.com> <20230614095501.m4porztaibchrgwx@techsingularity.net> <87ilbpo1d9.fsf@yhuang6-desk2.ccr.corp.intel.com> <5f340d98-4ee0-35a9-58ed-943834d68042@linux.alibaba.com> <87fs6tfaw5.fsf@yhuang6-desk2.ccr.corp.intel.com> <5b5e7dd0-d60b-ca46-215c-f59947b805fe@redhat.com> Date: Thu, 15 Jun 2023 16:38:17 +0800 In-Reply-To: <5b5e7dd0-d60b-ca46-215c-f59947b805fe@redhat.com> (David Hildenbrand's message of "Thu, 15 Jun 2023 09:46:53 +0200") Message-ID: <87bkhhf7d2.fsf@yhuang6-desk2.ccr.corp.intel.com> User-Agent: Gnus/5.13 (Gnus v5.13) Emacs/28.2 (gnu/linux) MIME-Version: 1.0 Content-Type: text/plain; charset=ascii X-Rspamd-Queue-Id: 011942001C X-Rspam-User: X-Stat-Signature: qxppg9ys4sob3yqjthchu69kx3cigbas X-Rspamd-Server: rspam03 X-HE-Tag: 1686818399-126893 X-HE-Meta: U2FsdGVkX1916vl1rDLHDeCnj4NujAZLIbw/Eo2gTRQimzWV+Xf4wb9KUhPQZBSSMwyeMcG6oDeFMeDYD18HFbYFN0i/zIxHGpAPZtotgGNCsvrhXECecPHesRLX39FX0OjRwXlDxRkcnxv084GwFdPXk1HRoc+UtfZOZq3zHMxQM9lr/DZAjn7U98wS/BiqLGQeGabVzEammHpUJ6Q4Z0kgvr/tYnAoGguFJmCiovXOyv/IBokagLkcsqnnlIAoJQlOvnjvBFWRRVZnibKC0yQhGjszAeK1oo4C0v+2RMJetv3B+2Z2ZVfyc2pmXUKMv6+ekCNDSs54kVKhCZ98K8PLaSCTl6I485vSwCpZeGLMkM7gEXzLxOVW9Fz1OMvu9HqWyiBIRhqjxu/vPBULqH7fPY6kxozKu41xcvrQOnGtgeXpis1041W4jfNji9TvIccOe5LvmX4xaE47HntIaf9IAIbEr1BBKahmtsUM+yJbv5IgfTxQK5eANyMd8xEbMRJHSV0+ysA5c0C8QZKKnjMn7p1/L91ozXMalXFOq2EhPoLVoaxoqu67tuwdqpahrglj9ypRrn8EdsQ1YMpEyDDHBUPXCcUCF7vZ3/xyS5KsSfA0sSGh6IIwi2O4rushj84LVtsl74MONnSw8a7/DGjh3M4aznKoWSmvtM7v02UYl8M0q/qoj5AUEK8yiq/G0FvjApRBrBEfGjh1EaNHpE/U3TvTHFH2xqmMUA4ltlqKy5BJ2yiHej+Hl9J9jCBDwfabZAgLG3TxOzzxqRHpVRSq3/TyMAretxlmfSQuT7z17FgzF4qNE9BexMffBso9auMWMVacFs8FlYNEZBR5x5soBNLBfTURlLsdXSP2kt4Kt0W0ihhCesvYkALqNro3d9gs70kEtL4sQcoV+v0BQKeYF7JEy+2o32OopHqp1AGSNdTv2R57WN3y3EIWMg7CTUD5PjIrSDqx4DM5PVn xHbZPGEm lEchWO2SXKgsUi8vlBgEsuyO6ofumGp9F92kgPO4WJNKiREMNRmE2SPClj1NcR88T6Uy0kCWkxiGtj85kfsVi2nyP4ClAMoXXjdQlY/2mULjnCpti0zhXa0W/F5NgxxbJN1Zkv3LI/qjpTp4O9h0FWM19CeKC64Mg4B2B3T3t4PQxOQHgxSPkZuIwtli8XI1q/AkcPSRpGmjxgiJBb2FrlZVOVpXopdB9D+Nmq4/8pMzjW+lpSRiyfLMu8LuG4mdOM6qLyIQiTMhU5vhoG5oJSdRYO3ZcR+h0MAStNjUMfmVU4bVPBaArVxO+FyWIpPjwYrONv+pE+MS58xM= X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: David Hildenbrand writes: > On 15.06.23 09:22, Huang, Ying wrote: >> Baolin Wang writes: >> >>> On 6/15/2023 11:22 AM, Huang, Ying wrote: >>>> Hi, Mel, >>>> Mel Gorman writes: >>>> >>>>> On Tue, Jun 13, 2023 at 04:55:04PM +0800, Baolin Wang wrote: >>>>>> On some machines, the normal zone can have a large memory hole like >>>>>> below memory layout, and we can see the range from 0x100000000 to >>>>>> 0x1800000000 is a hole. So when isolating some migratable pages, the >>>>>> scanner can meet the hole and it will take more time to skip the large >>>>>> hole. From my measurement, I can see the isolation scanner will take >>>>>> 80us ~ 100us to skip the large hole [0x100000000 - 0x1800000000]. >>>>>> >>>>>> So adding a new helper to fast search next online memory section >>>>>> to skip the large hole can help to find next suitable pageblock >>>>>> efficiently. With this patch, I can see the large hole scanning only >>>>>> takes < 1us. >>>>>> >>>>>> [ 0.000000] Zone ranges: >>>>>> [ 0.000000] DMA [mem 0x0000000040000000-0x00000000ffffffff] >>>>>> [ 0.000000] DMA32 empty >>>>>> [ 0.000000] Normal [mem 0x0000000100000000-0x0000001fa7ffffff] >>>>>> [ 0.000000] Movable zone start for each node >>>>>> [ 0.000000] Early memory node ranges >>>>>> [ 0.000000] node 0: [mem 0x0000000040000000-0x0000000fffffffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001800000000-0x0000001fa3c7ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa3c80000-0x0000001fa3ffffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa4000000-0x0000001fa402ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa4030000-0x0000001fa40effff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa40f0000-0x0000001fa73cffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa73d0000-0x0000001fa745ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa7460000-0x0000001fa746ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa7470000-0x0000001fa758ffff] >>>>>> [ 0.000000] node 0: [mem 0x0000001fa7590000-0x0000001fa7ffffff] >>>>>> >>>>>> Signed-off-by: Baolin Wang >>>>> >>>>> This may only be necessary for non-contiguous zones so a check for >>>>> zone_contiguous could be made but I suspect the saving, if any, would be >>>>> marginal. >>>>> >>>>> However, it's subtle that block_end_pfn can end up in an arbirary location >>>>> past the end of the zone or past cc->free_pfn. As the "continue" will update >>>>> cc->migrate_pfn, that might lead to errors in the future. It would be a >>>>> lot safer to pass in cc->free_pfn and do two things with the value. First, >>>>> there is no point scanning for a valid online section past cc->free_pfn so >>>>> terminating after cc->free_pfn may save some cycles. Second, cc->migrate_pfn >>>>> does not end up with an arbitrary value which is a more defensive approach >>>>> to any future programming errors. >>>> I have thought about this before. Originally, I had thought that we >>>> were safe because cc->free_pfn should be in a online section and >>>> block_end_pfn should reach cc->free_pfn before the end of zone. But >>>> after checking more code and thinking about it again, I found that the >>>> underlying sections may go offline under us during compaction. So that, >>>> cc->free_pfn may be in a offline section or after the end of zone. So, >>>> you are right, we need to consider the range of block_end_pfn. >>>> But, if we thought in this way (memory online/offline at any time), >>>> it >>>> appears that we need to check whether the underlying section was >>>> offlined. For example, is it safe to use "pfn_to_page()" in >>>> "isolate_migratepages_block()"? Is it possible for the underlying >>>> section to be offlined under us? >>> >>> It is possible. There is a previous discussion[1] about the race >>> between pfn_to_online_page() and memory offline. >>> >>> [1] >>> https://lore.kernel.org/lkml/87zgc6buoq.fsf@nvidia.com/T/#m642d91bcc726437e1848b295bc57ce249c7ca399 >> Thank you very much for sharing! That answers my questions >> directly! > > I remember another discussion (but can't find it) regarding why memory > compaction can get away without pfn_to_online_page() all over the > place. The use is limited to __reset_isolation_pfn(). Per my understanding, isolate_migratepages() -> pageblock_pfn_to_page() will check whether the pageblock is online. So if the pageblock isn't offlined afterwards, we can use pfn_to_page(). > But yes, in theory pfn_to_online_page() can race with memory offlining. Thanks for confirmation. Best Regards, Huang, Ying