任意の列を条件にしてデータフレームをグルーピングし各グループ毎にソート – Pandas

Sorting based on some threshold in python – StackOverflow

これ,質問の意味や意図は違うかもしれないけど,
本トピックでは表題の事を考えてみる.

まず,グルーピング.これは,今までの使い回しでできるので凄く簡単.
(そしておそらくこれがリーズナブルチョイス;最も時間効率が良いと思う)

データフレーム列内の連続した値の累積和-任意の範囲でcumsum

任意の条件でクロス集計を行う

 
 

import io
import pandas as pd
import numpy as np


def specified_cumsum(a, ind):
  arr = a.copy()
  arr[ind[1:]] -= np.add.reduceat(arr, ind)[:-1]
  idx = np.concatenate((ind[1:], [ind.max()+1])) - 1
  return arr.cumsum(0)


strings = """123456 0.7201741337776184 0.028222769498825073
G3/26 0.7428475022315979 0.14198169112205505
BED: 0.6399385929107666 0.14720818400382996
02/12/1979 0.7476143836975098 0.2584814727306366
DOB: 0.6413170695304871 0.26108822226524353
AGE: 0.6393641233444214 0.3781811594963074
32YSEX: 0.7450932264328003 0.379544198513031
M 0.9087993502616882 0.38598716259002686
6056 0.5140257477760315 0.490386426448822
REL: 0.6386545300483704 0.49573469161987305
HINDU 0.7446932792663574 0.4978155195713043
18/01/18 0.7429159879684448 0.6074452996253967
ADM: 0.6375628709793091 0.6129379272460938
PAY: 0.6137194633483887 0.7307364344596863
AUF 0.8062918186187744 0.7317520380020142
PVT 0.726770281791687 0.732855498790741
80611895 0.7259002923965454 0.8441673517227173
FND: 0.6144964098930359 0.847784698009491"""

df = pd.read_table(io.StringIO(strings), sep='\s+', header=None)
df.columns = 'right,rightX,rightY'.split(',')

arr = df['rightY'].diff() > 0.05
ind =np.where(np.diff(arr))[0] + 1
df.assign(**{'diff_rightY': df['rightY'].diff(), 'group': specified_cumsum(arr, ind).cumsum()})
    right       rightX      rightY      diff_rightY group
0   123456      0.720174    0.028223    NaN     False
1   G3/26       0.742848    0.141982    0.113759    1
2   BED:        0.639939    0.147208    0.005226    1
3   02/12/1979  0.747614    0.258481    0.111273    2
4   DOB:        0.641317    0.261088    0.002607    2
5   AGE:        0.639364    0.378181    0.117093    3
6   32YSEX:     0.745093    0.379544    0.001363    3
7   M       0.908799    0.385987    0.006443    3
8   6056        0.514026    0.490386    0.104399    4
9   REL:        0.638655    0.495735    0.005348    4
10  HINDU       0.744693    0.497816    0.002081    4
11  18/01/18    0.742916    0.607445    0.109630    5
12  ADM:        0.637563    0.612938    0.005493    5
13  PAY:        0.613719    0.730736    0.117799    6
14  AUF     0.806292    0.731752    0.001016    6
15  PVT     0.726770    0.732855    0.001103    6
16  80611895    0.725900    0.844167    0.111312    7
17  FND:        0.614496    0.847785    0.003617    7

後はgroupbyしてsortするだけ.

df2 = df.assign(**{'diff_rightY': df['rightY'].diff(), 'group': (specified_cumsum(arr, ind).cumsum()).astype(int)})
df2.groupby('group').apply(lambda x: x.sort_values('rightX'))
    right   rightX  rightY  diff_rightY group
group                       
0   0   123456  0.720174    0.028223    NaN 0
1   2   BED:    0.639939    0.147208    0.005226    1
1   G3/26   0.742848    0.141982    0.113759    1
2   4   DOB:    0.641317    0.261088    0.002607    2
3   02/12/1979  0.747614    0.258481    0.111273    2
3   5   AGE:    0.639364    0.378181    0.117093    3
6   32YSEX: 0.745093    0.379544    0.001363    3
7   M   0.908799    0.385987    0.006443    3
4   8   6056    0.514026    0.490386    0.104399    4
9   REL:    0.638655    0.495735    0.005348    4
10  HINDU   0.744693    0.497816    0.002081    4
5   12  ADM:    0.637563    0.612938    0.005493    5
11  18/01/18    0.742916    0.607445    0.109630    5
6   13  PAY:    0.613719    0.730736    0.117799    6
15  PVT 0.726770    0.732855    0.001103    6
14  AUF 0.806292    0.731752    0.001016    6
7   17  FND:    0.614496    0.847785    0.003617    7
16  80611895    0.725900    0.844167    0.111312    7

多分,OPが期待する出力を求めるならこう.

df2.groupby('group').apply(lambda x: x.sort_values('rightX').head(1))
        right   rightX  rightY  diff_rightY group
group                       
0   0   123456  0.720174    0.028223    NaN 0
1   2   BED:    0.639939    0.147208    0.005226    1
2   4   DOB:    0.641317    0.261088    0.002607    2
3   5   AGE:    0.639364    0.378181    0.117093    3
4   8   6056    0.514026    0.490386    0.104399    4
5   12  ADM:    0.637563    0.612938    0.005493    5
6   13  PAY:    0.613719    0.730736    0.117799    6
7   17  FND:    0.614496    0.847785    0.003617    7

 
 
勿論,時間効率を気にしなければ,Numpyを使わずPandasだけでもっと簡単にできる.

import io
import pandas as pd


strings = """123456 0.7201741337776184 0.028222769498825073
G3/26 0.7428475022315979 0.14198169112205505
BED: 0.6399385929107666 0.14720818400382996
02/12/1979 0.7476143836975098 0.2584814727306366
DOB: 0.6413170695304871 0.26108822226524353
AGE: 0.6393641233444214 0.3781811594963074
32YSEX: 0.7450932264328003 0.379544198513031
M 0.9087993502616882 0.38598716259002686
6056 0.5140257477760315 0.490386426448822
REL: 0.6386545300483704 0.49573469161987305
HINDU 0.7446932792663574 0.4978155195713043
18/01/18 0.7429159879684448 0.6074452996253967
ADM: 0.6375628709793091 0.6129379272460938
PAY: 0.6137194633483887 0.7307364344596863
AUF 0.8062918186187744 0.7317520380020142
PVT 0.726770281791687 0.732855498790741
80611895 0.7259002923965454 0.8441673517227173
FND: 0.6144964098930359 0.847784698009491"""

df = pd.read_table(io.StringIO(strings), sep='\s+', header=None)
df.columns = 'right,rightX,rightY'.split(',')

df.assign(**{'diff_rightY': df['rightY'].diff(), 'group': (df['rightY'].diff()>0.05).cumsum()})
    right   rightX  rightY  diff_rightY group
0   123456  0.720174    0.028223    NaN 0
1   G3/26   0.742848    0.141982    0.113759    1
2   BED:    0.639939    0.147208    0.005226    1
3   02/12/1979  0.747614    0.258481    0.111273    2
4   DOB:    0.641317    0.261088    0.002607    2
5   AGE:    0.639364    0.378181    0.117093    3
6   32YSEX: 0.745093    0.379544    0.001363    3
7   M   0.908799    0.385987    0.006443    3
8   6056    0.514026    0.490386    0.104399    4
9   REL:    0.638655    0.495735    0.005348    4
10  HINDU   0.744693    0.497816    0.002081    4
11  18/01/18    0.742916    0.607445    0.109630    5
12  ADM:    0.637563    0.612938    0.005493    5
13  PAY:    0.613719    0.730736    0.117799    6
14  AUF 0.806292    0.731752    0.001016    6
15  PVT 0.726770    0.732855    0.001103    6
16  80611895    0.725900    0.844167    0.111312    7
17  FND:    0.614496    0.847785    0.003617    7
df2 = df.assign(**{'diff_rightY': df['rightY'].diff(), 'group': (df['rightY'].diff()>0.05).cumsum()})
df2.groupby('group').apply(lambda x: x.sort_values('rightX').head(1))
right   rightX  rightY  diff_rightY group
group                       
0   0   123456  0.720174    0.028223    NaN 0
1   2   BED:    0.639939    0.147208    0.005226    1
2   4   DOB:    0.641317    0.261088    0.002607    2
3   5   AGE:    0.639364    0.378181    0.117093    3
4   8   6056    0.514026    0.490386    0.104399    4
5   12  ADM:    0.637563    0.612938    0.005493    5
6   13  PAY:    0.613719    0.730736    0.117799    6
7   17  FND:    0.614496    0.847785    0.003617    7
カテゴリー: 未分類 パーマリンク

任意の列を条件にしてデータフレームをグルーピングし各グループ毎にソート – Pandas への1件のフィードバック

  1. ピンバック: シーケンシャルな値の累積和(cumsum) | 粉末@それは風のように (日記)

コメントは受け付けていません。