任意のアレイについて別のアレイに基づいて値をインクリメント(指定された間隔内で等間隔の値を生成)

Increment and insert values based on a specific value of another array – StackOverflow

「get_ranges_arr」関数を用いると簡単に処理できる.
(任意の始点/終点を指定してarangeの様に動作する関数;開始インデックスと終了インデックスを表す2つのnumpy.ndarrayから範囲を求めて,連続する値を求める関数)

この問題にフィーチャーした関数では無いが,汎用性が高い.

import numpy as np


def get_ranges_arr(s, e, dt=1, endpoint=True, dtype=int):
    if endpoint: e = e + 1
    v = np.ceil((e - s)/dt).astype(dtype).cumsum(0)
    out = np.ones(v[-1], dtype=dtype) * dt
    out[0] = s[0]
    out[v[:-1]] = s[1:] - e[:-1] + 1
    return out.cumsum(0)


a = np.array([0,2,9,12,18,19])
b = np.array([1,1,1,2,1,3])
c = a + b
result = get_ranges_arr(a, c, endpoint=False)
np.testing.assert_allclose(result, np.array([0,2,9,12,13,18,19,20,21]))
result
array([ 0,  2,  9, 12, 13, 18, 19, 20, 21])
カテゴリー: 未分類 | コメントをどうぞ

csvを取得してtsvに変換

Remove quotes “…” from all columns with AWK – StackOverflow

タイトルも本文も期待される出力もチグハグなので,勝手に考えてみると.まず,期待される出力はおかしいので,優秀なCSVパーサを積んでいるPandasの結果が期待される出力と考える.

import pandas as pd


df = pd.read_csv('file', dtype=str)
print(df.fillna('null').to_csv(sep='\t', index=False))
visitor_date    country browser browser_version id1 id2 id3 id4 true or false column
05/10/2019 00:55    United States   App null    692467  2163702 4117512086  format  0
05/10/2019 00:56    United States   SamsungBrowser  8   692467  2163702 4117512083  format  1
05/10/2019 01:08    United States   safari  11  692467  2163704 4117511299  format  1
05/10/2019 03:25    United States   SamsungBrowser  8   692467  2163704 4117511299  format  null
05/10/2019 07:50    United States   chrome  58  692467  2163704 4117511299  format  null
05/10/2019 09:18    United States   internet explorer   11  692467  2163702 4117512086  format  0

これと同じ事をAwkで考えると,

%%bash
time {
awk 'BEGIN{FS=",";OFS="\t"}{gsub(/""/, "null");gsub(/"/, "");$1=$1}1' file
}
visitor_date    country browser browser_version id1 id2 id3 id4 true or false column
05/10/2019 00:55    United States   App null    692467  2163702 4117512086  format  0
05/10/2019 00:56    United States   SamsungBrowser  8   692467  2163702 4117512083  format  1
05/10/2019 01:08    United States   safari  11  692467  2163704 4117511299  format  1
05/10/2019 03:25    United States   SamsungBrowser  8   692467  2163704 4117511299  format  null
05/10/2019 07:50    United States   chrome  58  692467  2163704 4117511299  format  null
05/10/2019 09:18    United States   internet explorer   11  692467  2163702 4117512086  format  0

real    0m0.003s
user    0m0.002s
sys 0m0.001s

コメントにある様なケース(CSVの定義として,カンマ区切りのフィールドで,フィールド内にカンマがある場合二重引用符で囲む)の場合,

import io
import pandas as pd


strings = """A,,C,"D,E,F","G",I,"K,L,M",Z"""
df = pd.read_csv(io.StringIO(strings), header=None)
df
    0   1   2   3   4   5   6   7
0   A   NaN C   D,E,F   G   I   K,L,M   Z

mawkでは一から考えないといけないが,gawkの場合はFPAT変数が用意されている.

4.7 Defining Fields by Content – The GNU Awk User’s Guide

CSVの様な定形的データの場合,trやsedで処理するのは(余程安易な(強い仮定を置ける)ケースでない限り)リーズナブルでは無い.構造を無視して,テキストとして処理してしまうより,決まった構造に沿って処理するべきなので,だからこそコメントにある様に,CSVパーサなり,gawkなりを用いるべき.

%%writefile test.py
#!/usr/bin/env python
import sys
import csv


reader = csv.reader(sys.stdin)
writer = csv.writer(sys.stdout, 'excel-tab')
writer.writerows(reader)
%%writefile test2.py
#!/usr/bin/env python
import sys
import csv


with open(sys.argv[1], 'r') as csvfile:
    reader = list(csv.reader(csvfile))

with open('out.tsv', 'w') as csvfile:
    writer = csv.writer(csvfile, 'excel-tab')
    writer.writerows(reader)
%%bash
time {
cat file | python test.py
}
visitor_date    country browser browser_version id1 id2 id3 id4 true or false column
05/10/2019 00:55    United States   App     692467  2163702 4117512086  format  0
05/10/2019 00:56    United States   SamsungBrowser  8   692467  2163702 4117512083  format  1
05/10/2019 01:08    United States   safari  11  692467  2163704 4117511299  format  1
05/10/2019 03:25    United States   SamsungBrowser  8   692467  2163704 4117511299  format  
05/10/2019 07:50    United States   chrome  58  692467  2163704 4117511299  format  
05/10/2019 09:18    United States   internet explorer   11  692467  2163702 4117512086  format  0

real    0m0.059s
user    0m0.042s
sys 0m0.021s
カテゴリー: 未分類 | コメントをどうぞ

シンボリック方程式を因数分解

Factor out a symbolic equation algorithm – StackOverflow

勉学の為には抽象代数学から学ぶべきだろうけど,実利的にはSympyの実装を眺める方が簡単かつ得るものが大きいのでは.

import sympy
from sympy.abc import a, b, c


f = a*b + a*c + b**2 + b*c
sympy.factor(f)
(a + b)*(b + c)
カテゴリー: 未分類 | コメントをどうぞ

特定のCSV列のすべての引用符を削除

Using SED or AWK to remove all quotes in a specific CSV column – StackOverflow

汎化性も何もあったものじゃないが.CSVなので,任意の列数で,各列がそれぞれ何かしらのファセットであれば,入力もある程度仮定できる筈なので,情報が十分にあれば正規表現が書けるだろう.今の場合,任意の定形を想定して,

import re


def reformated(strings, pat=None):
    def _f(x):
        return x.group(1) + '"' + x.group(2).replace('"', '') + '"' + x.group(3)

    if pat is None:
        pat = r'(["\d]+,["\d]+,.*?,\d+,)(.*?)(,"\d{4}-\d{2}-\d{2}.*?","\w+",)'
    return re.sub(pat, _f, strings)


strings = """"123","456",,17,"hello," how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ", 
"345","737",,16,"Heading to a "meeting", unprepared while trying to be "awake","2018-05-29T18:58:10-05:00","ACD","""

print(reformated(strings))
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ", 
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",

さて,これは多分SEDでは難しいが,Awkだとまあそれ程難しくはない.どちらにしても,どの様に仮定を置くかが問われるが.文字列部分の扱いが不安定で,前後についてはセル数が決まっていると仮定すれば,

%%bash
time {
awk -F',' '{s=index($0,$5);e=index($0,$(NF-2))-1;a=substr($0, s, e-s);gsub(/"/, "", a);print substr($0,0,s)"\""a"\""substr($0,e)}' file
}
"123","456",,17,"hello, how are you this, fine, highly caffienated morning,","2018-05-29T18:58:10-05:00","XYZ", 
"345","737",,16,"Heading to a meeting, unprepared while trying to be awake","2018-05-29T18:58:10-05:00","ACD",

real    0m0.003s
user    0m0.001s
sys 0m0.002s
カテゴリー: 未分類 | コメントをどうぞ

組合せ最適化 – サブセットから合計がターゲット値に最も近い組み合わせを求める

Find nearest value to a target with a given set of values where repetition is allowed – StackOverflow

!pip install -q pulp
import pulp
 

def soln(S, t, printing=True):
    var = [pulp.LpVariable(f'var{i}', 0, cat='Integer') for i in range(len(S))]
    p = pulp.LpProblem('The greatest sum, equal or smaller to t', pulp.LpMaximize)
    p += pulp.lpDot(var, S) - pulp.lpSum(var)
    p += pulp.lpDot(var, S) <= t
    ss = p.solve()
    print(p)
    print(f'Status: {pulp.LpStatus[ss]}', end='\n\n')
    if printing:
        res = [x.varValue for x in var]
        print('*** Result ***')
        print(f'Total: {p.objective.value()+sum(res)}')
        print(res, end='\n\n')
    else:
        return p


Slst = ([500, 1000, 2000, 5000], [500, 1000, 2000, 5000], [222,333])
tlst = (7000, 7990, 444)
for S, t in zip(Slst, tlst):
    print(f'Given values: {S}, Target: {t}', end='\n\n')
    soln(S, t)
Given values: [500, 1000, 2000, 5000], Target: 7000

The greatest sum, equal or smaller to t:
MAXIMIZE
499*var0 + 999*var1 + 1999*var2 + 4999*var3 + 0
SUBJECT TO
_C1: 500 var0 + 1000 var1 + 2000 var2 + 5000 var3 <= 7000

VARIABLES
0 <= var0 Integer
0 <= var1 Integer
0 <= var2 Integer
0 <= var3 Integer

Status: Optimal

*** Result ***
Total: 7000.0
[0.0, 0.0, 1.0, 1.0]

Given values: [500, 1000, 2000, 5000], Target: 7990

The greatest sum, equal or smaller to t:
MAXIMIZE
499*var0 + 999*var1 + 1999*var2 + 4999*var3 + 0
SUBJECT TO
_C1: 500 var0 + 1000 var1 + 2000 var2 + 5000 var3 <= 7990

VARIABLES
0 <= var0 Integer
0 <= var1 Integer
0 <= var2 Integer
0 <= var3 Integer

Status: Optimal

*** Result ***
Total: 7500.0
[1.0, 0.0, 1.0, 1.0]

Given values: [222, 333], Target: 444

The greatest sum, equal or smaller to t:
MAXIMIZE
221*var0 + 332*var1 + 0
SUBJECT TO
_C1: 222 var0 + 333 var1 <= 444

VARIABLES
0 <= var0 Integer
0 <= var1 Integer

Status: Optimal

*** Result ***
Total: 444.0
[2.0, 0.0]
カテゴリー: 未分類 | コメントをどうぞ

データフレームから対称ペアを削除 – 行ごとのユニーク値のみを保持

Find symmetric pairs quickly in numpy – StackOverflow

数値データの場合,Numpy solnがリーズナブルチョイス.

import pandas as pd
import numpy as np
from sklearn.utils.extmath import cartesian


def view1D(a):
    a = np.ascontiguousarray(a)
    void_dt = np.dtype((np.void, a.itemsize*a.shape[1]))
    return a.view(void_dt).ravel()


def isin_nd(a, b):
    A, B = view1D(a.reshape(a.shape[0], -1)), view1D(b.reshape(b.shape[0], -1))
    return np.isin(A, B)


def drop_symmetric_pairs(arr):
    m = isin_nd(arr, arr[:, ::-1])
    v = np.unique(view1D(np.sort(arr[m], 1))).view('(2,)i8')
    out = np.concatenate((arr[~m], v))
    return out


col = ['c1', 'c2']
df = (
    pd.DataFrame(cartesian((range(10),)*2), columns=col)
    .sample(90)
    .sort_values(col)
    .reset_index(drop=True)
)

res = pd.DataFrame(drop_symmetric_pairs(df.to_numpy()))
res
    0   1
0   0   9
1   1   5
2   2   2
3   4   2
4   5   9
5   6   9
6   7   2
7   8   6
8   9   3
9   9   6
10  0   0
11  0   1
12  0   3
13  0   4
14  0   5
15  0   6
16  0   7
17  0   9
18  1   1
19  1   2
20  1   4
21  1   5
22  1   7
23  1   8
24  1   9
25  2   2
26  2   3
27  2   4
28  2   5
29  2   6
30  2   8
31  2   9
32  3   3
33  3   5
34  3   6
35  3   7
36  3   8
37  4   4
38  4   5
39  4   6
40  4   7
41  4   8
42  4   9
43  5   5
44  5   6
45  5   7
46  5   9
47  6   6
48  6   7
49  7   7
50  7   8
51  7   9
52  8   8
53  8   9
54  9   9
def drop_symmetric_pairs(arr):
    m = isin_nd(arr, arr[:, ::-1])
    v = np.unique(view1D(np.sort(arr[m], 1))).view('(2,)i8')
    out = np.concatenate((arr[~m], v))
    return out


def drop_symmetric_pairs2(arr):
    a = np.sort(arr, 1)
    out = np.unique(view1D(a)).view('(2,)i8')
    return out


df2 = (
    pd.DataFrame(cartesian((range(10000),)*2), columns="c1 c2".split())
    .sample(10000)
    .sort_values(df.columns.tolist())
    .reset_index(drop=True)
)

res1 = drop_symmetric_pairs(df2.to_numpy())
res2 = drop_symmetric_pairs2(df2.to_numpy())
assert isin_nd(np.sort(res1, 1), np.sort(res2, 1)).all()
%timeit res1 = pd.DataFrame(drop_symmetric_pairs(df2.to_numpy()))
%timeit res2 = pd.DataFrame(drop_symmetric_pairs2(df2.to_numpy()))
100 loops, best of 3: 5.64 ms per loop
100 loops, best of 3: 2.67 ms per loop

効率はかなり悪くなる(x10-20 slower)が,Pandas solnを考えるとシンプルに書ける.

res = df.apply(np.sort, axis=1, raw=True).drop_duplicates()
res
    c1  c2
0   0   0
1   0   1
2   0   4
3   0   5
4   0   6
5   0   7
6   0   8
7   0   9
9   1   1
10  1   2
11  1   3
12  1   5
13  1   7
14  1   8
15  1   9
16  0   2
18  2   2
19  2   3
20  2   4
21  2   5
22  2   6
23  2   7
24  2   9
25  0   3
28  3   3
29  3   4
30  3   5
31  3   6
32  3   7
33  3   8
34  3   9
36  1   4
39  4   4
40  4   5
41  4   6
42  4   7
43  4   8
44  4   9
50  5   5
51  5   6
52  5   7
53  5   8
54  5   9
58  6   6
59  6   7
60  6   8
61  6   9
68  7   7
69  7   8
70  7   9
73  2   8
78  8   8
79  8   9
89  9   9

数値データ以外では,ピュアにやるか,Pandas solnを考える方が良い.

df2[~df2.apply(frozenset, 1).duplicated()]

数値データの場合は,効率は雲泥の差なので,Numpy solnを選択しない理由はない.

df2 = (
    pd.DataFrame(cartesian((range(10000),)*2), columns="c1 c2".split())
    .sample(10000)
    .sort_values(df.columns.tolist())
    .reset_index(drop=True)
)

res1 = drop_symmetric_pairs(df2.to_numpy())
res2 = drop_symmetric_pairs2(df2.to_numpy())
res3 = df2[~df2.apply(frozenset, 1).duplicated()].to_numpy()
assert isin_nd(np.sort(res1, 1), np.sort(res2, 1)).all()
assert isin_nd(np.sort(res1, 1), np.sort(res3, 1)).all()
%timeit res1 = pd.DataFrame(drop_symmetric_pairs(df2.to_numpy()))
%timeit res2 = pd.DataFrame(drop_symmetric_pairs2(df2.to_numpy()))
%timeit res3 = df2[~df2.apply(frozenset, 1).duplicated()]
100 loops, best of 3: 5.69 ms per loop
100 loops, best of 3: 2.67 ms per loop
10 loops, best of 3: 157 ms per loop
カテゴリー: 未分類 | コメントをどうぞ

任意の行列の各行に行列を乗算

how to multiply a matrix with every row in another matrix using numpy – StackOverflow

悩んだときのアインシュタイン記法.

import numpy as np


a = np.array(
    [
        [0, 1, 1],
        [2, 2, 0],
        [3, 0, 3]
    ]
)

b = np.array(
    [
        [1, 1, 1],
        [2, 2, 2],
        [3, 2, 9],
        [4, 4, 4],
        [5, 9, 5]
    ]
)

np.einsum('ij,kj->ik', b, a) # b @ a.T
array([[ 2,  4,  6],
       [ 4,  8, 12],
       [11, 10, 36],
       [ 8, 16, 24],
       [14, 28, 30]])

効率は極めて悪いが,ブロードキャストを用いると分かり易い.

(a * b[:, None]).sum(-1)
array([[ 2,  4,  6],
       [ 4,  8, 12],
       [11, 10, 36],
       [ 8, 16, 24],
       [14, 28, 30]])
a = np.random.randint(0, 1000, (3000, 3))
b = np.random.randint(0, 1000, (10000, 3))
%timeit b @ a.T
%timeit np.einsum('ij,kj->ik', b, a)
%timeit np.einsum('ij,jk->ik', b, a.T)
%timeit np.einsum('ij,kij->ki', a, b[:, None])
%timeit np.einsum('ij,kj->ik', b, a, optimize=True)
%timeit np.einsum('ij,jk->ik', b, a.T, optimize=True)
%timeit np.einsum('ij,kij->ki', a, b[:, None], optimize=True)
%timeit (a * b[:, None]).sum(-1)
10 loops, best of 3: 132 ms per loop
1 loop, best of 3: 204 ms per loop
1 loop, best of 3: 192 ms per loop
10 loops, best of 3: 198 ms per loop
10 loops, best of 3: 182 ms per loop
10 loops, best of 3: 183 ms per loop
1 loop, best of 3: 206 ms per loop
1 loop, best of 3: 861 ms per loop
カテゴリー: 未分類 | コメントをどうぞ