Pandas: chuỗi trùng lặp đoàn

Tôi có dataframePandas: chuỗi trùng lặp đoàn

ID  url  date active_seconds 
111 vk.com 12.01.2016 5 
111 facebook.com 12.01.2016 4 
111 facebook.com 12.01.2016 3 
111 twitter.com 12.01.2016 12 
222 vk.com  12.01.2016 8 
222 twitter.com 12.01.2016 34 
111 facebook.com 12.01.2016 5

và tôi cần phải nhận được

ID  url  date active_seconds 
111 vk.com 12.01.2016 5 
111 facebook.com 12.01.2016 7 
111 twitter.com 12.01.2016 12 
222 vk.com  12.01.2016 8 
222 twitter.com 12.01.2016 34 
111 facebook.com 12.01.2016 5

Nếu tôi cố gắng

df.groupby(['ID', 'url'])['active_seconds'].sum()

nó đoàn tất cả các chuỗi. Làm thế nào tôi nên làm để có được mong muốn?

Nguồn

2017-01-13 Petr Petrov

(s != s.shift()).cumsum() là một cách điển hình để xác định các nhóm tiếp giáp số nhận dạng
pd.DataFrame.assign là một cách thuận tiện để thêm một cột mới vào một bản sao của một dataframe và chuỗi nhiều phương pháp
pivot_table cho phép chúng ta cấu hình lại bàn chúng tôi và tổng hợp
args - đây là một sở thích phong cách của tôi để giữ mã sạch hơn tìm kiếm. Tôi sẽ vượt qua những lập luận để pivot_table qua *args
reset_index * 2 để dọn dẹp và nhận được để kết quả cuối cùng

args = ('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum') 
df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table(*args) \ 
    .reset_index([1, 2, 3]).reset_index(drop=True) 

    ID   url  date active_seconds 
0 111 facebook.com 12.01.2016    7 
1 111 twitter.com 12.01.2016    12 
2 111  vk.com 12.01.2016    5 
3 222 twitter.com 12.01.2016    34 
4 222  vk.com 12.01.2016    8 
5 111 facebook.com 12.01.2016    5

Nguồn

2017-01-13 12:01:39 piRSquared

nó trông giống như bạn muốn có một cumsum():

In [195]: df.groupby(['ID', 'url'])['active_seconds'].cumsum() 
Out[195]: 
0  5 
1  4 
2  7 
3 12 
4  8 
5 34 
6 12 
Name: active_seconds, dtype: int64

Nguồn

2017-01-13 11:31:40 MaxU

Giải pháp 1-cumsum theo cột url chỉ:

Bạn cần groupby theo phong tục Series tạo ra bởi cumsum mặt nạ boolean, nhưng sau đó cột url cần aggregate bởi first. Sau đó loại bỏ mức url với reset_index và sắp xếp lại trước cột bằng reindex:

g = (df.url != df.url.shift()).cumsum() 
print (g) 
0 1 
1 2 
2 2 
3 3 
4 4 
5 5 
6 6 
Name: url, dtype: int32 

g = (df.url != df.url.shift()).cumsum() 
#another solution with ne 
#g = df.url.ne(df.url.shift()).cumsum() 

print (df.groupby([df.ID,df.date,g], sort=False).agg({'active_seconds':'sum', 'url':'first'}) 
     .reset_index(level='url', drop=True) 
     .reset_index() 
     .reindex(columns=df.columns)) 

    ID   url  date active_seconds 
0 111  vk.com 12.01.2016    5 
1 111 facebook.com 12.01.2016    7 
2 111 twitter.com 12.01.2016    12 
3 222  vk.com 12.01.2016    8 
4 222 twitter.com 12.01.2016    34 
5 111 facebook.com 12.01.2016    5

g = (df.url != df.url.shift()).cumsum().rename('tmp') 
print (g) 
0 1 
1 2 
2 2 
3 3 
4 4 
5 5 
6 6 
Name: tmp, dtype: int32 

print (df.groupby([df.ID, df.url, df.date, g], sort=False)['active_seconds'] 
     .sum() 
     .reset_index(level='tmp', drop=True) 
     .reset_index()) 

    ID   url  date active_seconds 
0 111  vk.com 12.01.2016    5 
1 111 facebook.com 12.01.2016    7 
2 111 twitter.com 12.01.2016    12 
3 222  vk.com 12.01.2016    8 
4 222 twitter.com 12.01.2016    34 
5 111 facebook.com 12.01.2016    5

Giải pháp 2-cumsum bởi cột ID và url:

g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum() 
print (g) 
    ID url 
0 1 1 
1 1 2 
2 1 2 
3 1 3 
4 2 4 
5 2 5 
6 3 6 

print (df.groupby([g.ID, df.date, g.url], sort=False) 
     .agg({'active_seconds':'sum', 'url':'first'}) 
     .reset_index(level='url', drop=True) 
     .reset_index() 
     .reindex(columns=df.columns)) 

    ID   url  date active_seconds 
0 1  vk.com 12.01.2016    5 
1 1 facebook.com 12.01.2016    7 
2 1 twitter.com 12.01.2016    12 
3 2  vk.com 12.01.2016    8 
4 2 twitter.com 12.01.2016    34 
5 3 facebook.com 12.01.2016    5

Và giải pháp mà thêm cột df.url, nhưng là nece ssary rename cột trong helper df:

g = df[['ID','url']].ne(df[['ID','url']].shift()).cumsum() 
g.columns = g.columns + '1' 
print (g) 
    ID1 url1 
0 1  1 
1 1  2 
2 1  2 
3 1  3 
4 2  4 
5 2  5 
6 3  6 

print (df.groupby([df.ID, df.url, df.date, g.ID1, g.url1], sort=False)['active_seconds'] 
     .sum() 
     .reset_index(level=['ID1','url1'], drop=True) 
     .reset_index()) 

    ID   url  date active_seconds 
0 111  vk.com 12.01.2016    5 
1 111 facebook.com 12.01.2016    7 
2 111 twitter.com 12.01.2016    12 
3 222  vk.com 12.01.2016    8 
4 222 twitter.com 12.01.2016    34 
5 111 facebook.com 12.01.2016    5

Thời gian:

giải pháp tương tự, nhưng pivot_table là slowier như groupby:

In [180]: %timeit (df.assign(g=df.ID.ne(df.ID.shift()).cumsum()).pivot_table('active_seconds', ['g', 'ID', 'url', 'date'], None, 'sum').reset_index([1, 2, 3]).reset_index(drop=True)) 
100 loops, best of 3: 5.02 ms per loop 

In [181]: %timeit (df.groupby([df.ID, df.url, df.date, (df.url != df.url.shift()).cumsum().rename('tmp')], sort=False)['active_seconds'].sum().reset_index(level='tmp', drop=True).reset_index()) 
100 loops, best of 3: 3.62 ms per loop

Nguồn

2017-01-13 11:32:12 jezrael

Pandas: chuỗi trùng lặp đoàn

Trả lời

Các vấn đề liên quan