From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: X-Spam-Checker-Version: SpamAssassin 3.4.0 (2014-02-07) on aws-us-west-2-korg-lkml-1.web.codeaurora.org X-Spam-Level: X-Spam-Status: No, score=-5.3 required=3.0 tests=HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,SIGNED_OFF_BY,SPF_HELO_NONE,SPF_PASS,USER_AGENT_SANE_1 autolearn=ham autolearn_force=no version=3.4.0 Received: from mail.kernel.org (mail.kernel.org [198.145.29.99]) by smtp.lore.kernel.org (Postfix) with ESMTP id 74C1DC81C44 for ; Mon, 27 Apr 2020 21:18:21 +0000 (UTC) Received: from kanga.kvack.org (kanga.kvack.org [205.233.56.17]) by mail.kernel.org (Postfix) with ESMTP id 47A8E2072D for ; Mon, 27 Apr 2020 21:18:21 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org 47A8E2072D Authentication-Results: mail.kernel.org; dmarc=fail (p=none dis=none) header.from=linux.intel.com Authentication-Results: mail.kernel.org; spf=pass smtp.mailfrom=owner-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix) id CF0528E0005; Mon, 27 Apr 2020 17:18:20 -0400 (EDT) Received: by kanga.kvack.org (Postfix, from userid 40) id CA2418E0001; Mon, 27 Apr 2020 17:18:20 -0400 (EDT) X-Delivered-To: int-list-linux-mm@kvack.org Received: by kanga.kvack.org (Postfix, from userid 63042) id BB65A8E0005; Mon, 27 Apr 2020 17:18:20 -0400 (EDT) X-Delivered-To: linux-mm@kvack.org Received: from forelay.hostedemail.com (smtprelay0239.hostedemail.com [216.40.44.239]) by kanga.kvack.org (Postfix) with ESMTP id A24728E0001 for ; Mon, 27 Apr 2020 17:18:20 -0400 (EDT) Received: from smtpin18.hostedemail.com (10.5.19.251.rfc1918.com [10.5.19.251]) by forelay03.hostedemail.com (Postfix) with ESMTP id 543E18248047 for ; Mon, 27 Apr 2020 21:18:20 +0000 (UTC) X-FDA: 76754898360.18.cows79_2b9a728e2c63d X-HE-Tag: cows79_2b9a728e2c63d X-Filterd-Recvd-Size: 5763 Received: from mga07.intel.com (mga07.intel.com [134.134.136.100]) by imf43.hostedemail.com (Postfix) with ESMTP for ; Mon, 27 Apr 2020 21:18:19 +0000 (UTC) IronPort-SDR: lPy60Na2JQUeAfUVyGZ/TxVnxuozGrujfLjN6ObU5TuWPmNUNny0vQoxsVzJB3Fv1sCBcF9TSj LLsr6Zp8K+iw== X-Amp-Result: SKIPPED(no attachment in message) X-Amp-File-Uploaded: False Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orsmga105.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 27 Apr 2020 14:18:17 -0700 IronPort-SDR: oyLJkRsVpue4bj9uOaaHkO/AVLVkNIQkC2gqWstJzoa0BXMpUAta0QkJwbMQ07KrLzI3hHBrqA FDkQ9VHEQxmg== X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="5.73,325,1583222400"; d="scan'208";a="275611998" Received: from schen9-mobl.amr.corp.intel.com ([10.255.71.72]) by orsmga002.jf.intel.com with ESMTP; 27 Apr 2020 14:18:16 -0700 Subject: Re: [PATCH] swap: Try to scan more free slots even when fragmented To: Huang Ying , Andrew Morton Cc: linux-mm@kvack.org, linux-kernel@vger.kernel.org, Dave Hansen , Michal Hocko , Minchan Kim , Hugh Dickins References: <20200427030023.264780-1-ying.huang@intel.com> From: Tim Chen Message-ID: <9be88e63-9fc3-108a-4b2e-5185b1938d36@linux.intel.com> Date: Mon, 27 Apr 2020 14:18:16 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.6.0 MIME-Version: 1.0 In-Reply-To: <20200427030023.264780-1-ying.huang@intel.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Bogosity: Ham, tests=bogofilter, spamicity=0.000000, version=1.2.4 Sender: owner-linux-mm@kvack.org Precedence: bulk X-Loop: owner-majordomo@kvack.org List-ID: On 4/26/20 8:00 PM, Huang Ying wrote: > Now, the scalability of swap code will drop much when the swap device > becomes fragmented, because the swap slots allocation batching stops > working. To solve the problem, in this patch, we will try to scan a > little more swap slots with restricted effort to batch the swap slots > allocation even if the swap device is fragmented. Test shows that the > benchmark score can increase up to 37.1% with the patch. Details are > as follows. > > The swap code has a per-cpu cache of swap slots. These batch swap > space allocations to improve swap subsystem scaling. In the following > code path, > > add_to_swap() > get_swap_page() > refill_swap_slots_cache() > get_swap_pages() > scan_swap_map_slots() > > scan_swap_map_slots() and get_swap_pages() can return multiple swap > slots for each call. These slots will be cached in the per-CPU swap > slots cache, so that several following swap slot requests will be > fulfilled there to avoid the lock contention in the lower level swap > space allocation/freeing code path. > > But this only works when there are free swap clusters. If a swap > device becomes so fragmented that there's no free swap clusters, > scan_swap_map_slots() and get_swap_pages() will return only one swap > slot for each call in the above code path. Effectively, this falls > back to the situation before the swap slots cache was introduced, the > heavy lock contention on the swap related locks kills the scalability. > > Why does it work in this way? Because the swap device could be large, > and the free swap slot scanning could be quite time consuming, to > avoid taking too much time to scanning free swap slots, the > conservative method was used. > > In fact, this can be improved via scanning a little more free slots > with strictly restricted effort. Which is implemented in this patch. > In scan_swap_map_slots(), after the first free swap slot is gotten, we > will try to scan a little more, but only if we haven't scanned too > many slots (< LATENCY_LIMIT). That is, the added scanning latency is > strictly restricted. > > To test the patch, we have run 16-process pmbench memory benchmark on > a 2-socket server machine with 48 cores. Multiple ram disks are > configured as the swap devices. The pmbench working-set size is much > larger than the available memory so that swapping is triggered. The > memory read/write ratio is 80/20 and the accessing pattern is random, > so the swap space becomes highly fragmented during the test. In the > original implementation, the lock contention on swap related locks is > very heavy. The perf profiling data of the lock contention code path > is as following, > > _raw_spin_lock.get_swap_pages.get_swap_page.add_to_swap: 21.03 > _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 1.92 > _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 1.72 > _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 0.69 > > While after applying this patch, it becomes, > > _raw_spin_lock_irq.shrink_inactive_list.shrink_lruvec.shrink_node: 4.89 > _raw_spin_lock_irq.shrink_active_list.shrink_lruvec.shrink_node: 3.85 > _raw_spin_lock.free_pcppages_bulk.drain_pages_zone.drain_pages: 1.1 > _raw_spin_lock_irqsave.pagevec_lru_move_fn.__lru_cache_add.do_swap_page: 0.88 > > That is, the lock contention on the swap locks is eliminated. > > And the pmbench score increases 37.1%. The swapin throughput > increases 45.7% from 2.02 GB/s to 2.94 GB/s. While the swapout > throughput increases 45.3% from 2.04 GB/s to 2.97 GB/s. > Thanks. Acked-by: Tim Chen Tim > Signed-off-by: "Huang, Ying" > Cc: Dave Hansen > Cc: Michal Hocko > Cc: Minchan Kim > Cc: Tim Chen > Cc: Hugh Dickins