From mboxrd@z Thu Jan 1 00:00:00 1970 Return-Path: Received: from mail-ie0-f181.google.com (mail-ie0-f181.google.com [209.85.223.181]) by kanga.kvack.org (Postfix) with ESMTP id 04EB66B0032 for ; Fri, 8 May 2015 04:29:29 -0400 (EDT) Received: by iedfl3 with SMTP id fl3so65127272ied.1 for ; Fri, 08 May 2015 01:29:28 -0700 (PDT) Received: from mail-ig0-x233.google.com (mail-ig0-x233.google.com. [2607:f8b0:4001:c05::233]) by mx.google.com with ESMTPS id x3si3570550icm.57.2015.05.08.01.29.28 for (version=TLSv1.2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 08 May 2015 01:29:28 -0700 (PDT) Received: by igbyr2 with SMTP id yr2so15099672igb.0 for ; Fri, 08 May 2015 01:29:28 -0700 (PDT) MIME-Version: 1.0 In-Reply-To: <2137546797.259031431072097606.JavaMail.weblogic@epmlwas09d> References: <2137546797.259031431072097606.JavaMail.weblogic@epmlwas09d> Date: Fri, 8 May 2015 16:29:27 +0800 Message-ID: Subject: Re: Re: [EDT] oom_killer: find bulkiest task based on pss value From: yalin wang Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Sender: owner-linux-mm@kvack.org List-ID: To: yn.gaur@samsung.com, linux-mm@kvack.org Cc: "akpm@linux-foundation.org" , "linux-kernel@vger.kernel.org" , AJEET YADAV , Amit Arora 2015-05-08 16:01 GMT+08:00 Yogesh Narayan Gaur : > EP-2DAD0AFA905A4ACB804C4F82A001242F > > ------- Original Message ------- > Sender : yalin wang > Date : May 08, 2015 13:17 (GMT+05:30) > Title : Re: [EDT] oom_killer: find bulkiest task based on pss value > > 2015-05-08 13:29 GMT+08:00 Yogesh Narayan Gaur : >>> >>> EP-2DAD0AFA905A4ACB804C4F82A001242F >>> Hi Andrew, >>> >>> Presently in oom_kill.c we calculate badness score of the victim task a= s per the present RSS counter value of the task. >>> RSS counter value for any task is usually '[Private (Dirty/Clean)] + [S= hared (Dirty/Clean)]' of the task. >>> We have encountered a situation where values for Private fields are les= s but value for Shared fields are more and hence make total RSS counter val= ue large. Later on oom situation killing task with highest RSS value but as= Private field values are not large hence memory gain after killing this pr= ocess is not as per the expectation. >>> >>> For e.g. take below use-case scenario, in which 3 process are running i= n system. >>> All these process done mmap for file exist in present directory and the= n copying data from this file to local allocated pointers in while(1) loop = with some sleep. Out of 3 process, 2 process has mmaped file with MAP_SHARE= D setting and one has mapped file with MAP_PRIVATE setting. >>> I have all 3 processes in background and checks RSS/PSS value from user= space utility (utility over cat /proc/pid/smaps) >>> Before OOM, below is the consumed memory status for these 3 process (al= l processes run with oom_score_adj =3D 0) >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D >>> Comm : 1prg, Pid : 213 (values in kB) >>> Rss Shared Private Pss >>> Process : 375764 194596 181168 278460 >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D >>> Comm : 3prg, Pid : 217 (values in kB) >>> Rss Shared Private Pss >>> Process : 305760 32 305728 305738 >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D >>> Comm : 2prg, Pid : 218 (values in kB) >>> Rss Shared Private Pss >>> Process : 389980 194596 195384 292676 >>> =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D= =3D=3D=3D=3D >>> >>> Thus as per present code design, first it would select process [2prg : = 218] as bulkiest process as its RSS value is highest to kill. But if we kil= l this process then only ~195MB would be free as compare to expected ~389MB= . >>> Thus identifying the task based on RSS value is not accurate design and= killing that identified process didn=E2=80=99t release expected memory bac= k to system. >>> >>> We need to calculate victim task based on PSS instead of RSS as PSS val= ue calculates as >>> PSS value =3D [Private (Dirty/Clean)] + [Shared (Dirty/Clean) / no. of = shared task] >>> For above use-case scenario also, it can be checked that process [3prg = : 217] is having largest PSS value and by killing this process we can gain = maximum memory (~305MB) as compare to killing process identified based on R= SS value. >>> >>> -- >>> Regards, >>> Yogesh Gaur. > >> >>Great, >> >> in fact, i also encounter this scenario, >> I use USS (page map counter =3D=3D 1) pages >> to decide which process should be killed, >> seems have the same result as you use PSS, >> but PSS is better , it also consider shared pages, >> in case some process have large shared pages mapping >> but little Private page mapping >> >> BRs, >> Yalin > > I have made patch which identifies bulkiest task on basis of PSS value. P= lease check below patch. > This patch is correcting the way victim task gets identified in oom condi= tion. > > =3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D=3D > > From 1c3d7f552f696bdbc0126c8e23beabedbd80e423 Mon Sep 17 00:00:00 2001 > From: Yogesh Gaur > Date: Thu, 7 May 2015 01:52:13 +0530 > Subject: [PATCH] oom: find victim task based on pss > > This patch is identifying bulkiest task to kill by OOM on the basis of PS= S value > instead of present RSS values. > There can be scenario where task with highest RSS counter is consuming lo= t of shared > memory and killing that task didn't release expected amount of memory to = system. > PSS value =3D [Private (Dirty/Clean)] + [Shared (Dirty/Clean) / no. of sh= ared task] > RSS value =3D [Private (Dirty/Clean)] + [Shared (Dirty/Clean)] > Thus, using PSS value instead of RSS value as PSS value closely matches w= ith actual > memory usage by the task. > This patch is using smaps_pte_range() interface defined in CONFIG_PROC_PA= GE_MONITOR. > For case when CONFIG_PROC_PAGE_MONITOR disabled, this simply returns RSS = value count. > > Signed-off-by: Yogesh Gaur > Signed-off-by: Amit Arora > Reviewed-by: Ajeet Yadav > --- > fs/proc/task_mmu.c | 47 ++++++++++++++++++++++++++++++++++++++++++++++= + > include/linux/mm.h | 9 +++++++++ > mm/oom_kill.c | 9 +++++++-- > 3 files changed, 63 insertions(+), 2 deletions(-) > > diff --git a/fs/proc/task_mmu.c b/fs/proc/task_mmu.c > index 956b75d..dd962ff 100644 > --- a/fs/proc/task_mmu.c > +++ b/fs/proc/task_mmu.c > @@ -964,6 +964,53 @@ struct pagemapread { > bool v2; > }; > > +/** > + * get_mm_pss - function to determine PSS count of pages being used for = proc p > + * PSS value=3D[Private(Dirty/Clean)] + [Shared(Dirty/Clean)/no. of shar= ed task] > + * @p: task struct of which task we should calculate > + * @mm: mm struct of the task. > + * > + * This function needs to be called under task_lock for calling task 'p'= . > + */ > +long get_mm_pss(struct task_struct *p, struct mm_struct *mm) > +{ > + long pss =3D 0; > + struct vm_area_struct *vma =3D NULL; > + struct mem_size_stats mss; > + struct mm_walk smaps_walk =3D { > + .pmd_entry =3D smaps_pte_range, > + .private =3D &mss, > + }; > + > + if (mm =3D=3D NULL) > + return 0; > + > + /* task_lock held in oom_badness */ > + smaps_walk.mm =3D mm; > + > + if (!down_read_trylock(&mm->mmap_sem)) { > + pr_warn("Skipping task:%s\n", p->comm); > + return 0; > + } > + > + vma =3D mm->mmap; > + if (!vma) { > + up_read(&mm->mmap_sem); > + return 0; > + } > + > + while (vma) { > + memset(&mss, 0, sizeof(struct mem_size_stats)); > + walk_page_vma(vma, &smaps_walk); > + pss +=3D (unsigned long) (mss.pss >> (12 + PSS_SHIFT)); /= *PSS in PAGE */ > + > + /* Check next vma in list */ > + vma =3D vma->vm_next; > + } > + up_read(&mm->mmap_sem); > + return pss; > +} > + > #define PAGEMAP_WALK_SIZE (PMD_SIZE) > #define PAGEMAP_WALK_MASK (PMD_MASK) > > diff --git a/include/linux/mm.h b/include/linux/mm.h > index 47a9392..b6bb521 100644 > --- a/include/linux/mm.h > +++ b/include/linux/mm.h > @@ -1423,6 +1423,15 @@ static inline void setmax_mm_hiwater_rss(unsigned = long *maxrss, > *maxrss =3D hiwater_rss; > } > > +#ifdef CONFIG_PROC_PAGE_MONITOR > +long get_mm_pss(struct task_struct *p, struct mm_struct *mm); > +#else > +static inline long get_mm_pss(struct task_struct *p, struct mm_struct *m= m) > +{ > + return 0; > +} > +#endif > + > #if defined(SPLIT_RSS_COUNTING) > void sync_mm_rss(struct mm_struct *mm); > #else > diff --git a/mm/oom_kill.c b/mm/oom_kill.c > index 642f38c..537eb4c 100644 > --- a/mm/oom_kill.c > +++ b/mm/oom_kill.c > @@ -151,6 +151,7 @@ unsigned long oom_badness(struct task_struct *p, stru= ct mem_cgroup *memcg, > { > long points; > long adj; > + long pss =3D 0; > > if (oom_unkillable_task(p, memcg, nodemask)) > return 0; > @@ -167,9 +168,13 @@ unsigned long oom_badness(struct task_struct *p, str= uct mem_cgroup *memcg, > > /* > * The baseline for the badness score is the proportion of RAM th= at each > - * task's rss, pagetable and swap space use. > + * task's pss, pagetable and swap space use. > */ > - points =3D get_mm_rss(p->mm) + get_mm_counter(p->mm, MM_SWAPENTS)= + > + pss =3D get_mm_pss(p, p->mm); > + if (pss =3D=3D 0) /* make pss equals to rss, pseudo-pss */ > + pss =3D get_mm_rss(p->mm); > + > + points =3D pss + get_mm_counter(p->mm, MM_SWAPENTS) + > atomic_long_read(&p->mm->nr_ptes) + mm_nr_pmds(p->mm); > task_unlock(p); > > -- > 1.7.1 > > > -- > BRs > Yogesh Gaur. The logic seems ok , but i feel the code footprint is too large than the original method, maybe have some performance problems, i have add linux-mm mail list for mm expert to review . BRs Yalin. -- To unsubscribe, send a message with 'unsubscribe linux-mm' in the body to majordomo@kvack.org. For more info on Linux MM, see: http://www.linux-mm.org/ . Don't email: email@kvack.org