Königsweg Logo

2017 Alexander C.S. Hendorf, Königsweg GmbH, Mannheim

In [1]:
%run '2 helpers.py'
%config InlineBackend.figure_format = 'retina';

Effective Data Analysis with Pandas Indexes


Pandas: Data Selection & Indexes


In [2]:
import numpy as np
import pandas as pd
import random
import seaborn as sns
%matplotlib inline

Series

Series = Column

Series do have a data tpye.

In [3]:
series = pd.Series([3, 62, 75, 83, 47, 43, 39, 16, 19, 2])
In [4]:
series
Out[4]:
0     3
1    62
2    75
3    83
4    47
5    43
6    39
7    16
8    19
9     2
dtype: int64
In [5]:
pd.Series([3, 62, 75, 83, 47, 43, 39, 16, 19, 2, 2.66676])
Out[5]:
0      3.00000
1     62.00000
2     75.00000
3     83.00000
4     47.00000
5     43.00000
6     39.00000
7     16.00000
8     19.00000
9      2.00000
10     2.66676
dtype: float64
In [6]:
pd.Series([3, 62, 75, 83, 47, 43, 39, 16, 19, 2, "2"])
Out[6]:
0      3
1     62
2     75
3     83
4     47
5     43
6     39
7     16
8     19
9      2
10     2
dtype: object
In [7]:
# Falle: int -> Float bedingt durch Implementierung in NumPy.
pd.Series([3, None, 6])
Out[7]:
0    3.0
1    NaN
2    6.0
dtype: float64

Selection via Position / Slice

In [8]:
series[0]
Out[8]:
3
In [9]:
series[3:6]
Out[9]:
3    83
4    47
5    43
dtype: int64

iloc uses square backets [ ]

In [10]:
# series[3:6]
series.iloc[3:6]
Out[10]:
3    83
4    47
5    43
dtype: int64

Selection via Label

In [11]:
# set alpha label as new index for the series
series.index = [x for x in "ABCDEFGHIJKLMNOPQRSTUVWXYZ"][:len(series)]
In [12]:
series
Out[12]:
A     3
B    62
C    75
D    83
E    47
F    43
G    39
H    16
I    19
J     2
dtype: int64
In [13]:
series['C']
Out[13]:
75
In [14]:
series.loc['C']
Out[14]:
75
In [15]:
series['D':'F']
# by label: slice includes end! 
Out[15]:
D    83
E    47
F    43
dtype: int64
In [16]:
series[2]
# position, pythonic
Out[16]:
75

What about not unique indexes?

In [17]:
# set alpha label as new index for the series
series.index = [x for x in "GATTACAXYZ"][:len(series)]
In [18]:
series
Out[18]:
G     3
A    62
T    75
T    83
A    47
C    43
A    39
X    16
Y    19
Z     2
dtype: int64
In [19]:
series.loc['G']
Out[19]:
3
In [20]:
try:
    series.loc['G':'A']
except Exception as e:
    print("ERROR:", e)
    # non-unique values breaks slicing
ERROR: "Cannot get right slice bound for non-unique label: 'A'"
In [21]:
series.loc['X':'Z']
# while unique values are still slicable in a non-unique index
Out[21]:
X    16
Y    19
Z     2
dtype: int64
In [22]:
series.index.is_monotonic
Out[22]:
False
In [23]:
series.index.is_unique
Out[23]:
False
In [24]:
series.index = range(series.size)
series
Out[24]:
0     3
1    62
2    75
3    83
4    47
5    43
6    39
7    16
8    19
9     2
dtype: int64
In [25]:
series
Out[25]:
0     3
1    62
2    75
3    83
4    47
5    43
6    39
7    16
8    19
9     2
dtype: int64
In [26]:
series.index.is_unique
Out[26]:
True
In [27]:
series.index.is_monotonic
Out[27]:
True
In [28]:
series.index.is_monotonic_increasing
Out[28]:
True


structure.015.jpeg

Achtung: Panels are deprecated use MultiIndex



DataFrames, 2D Data

In [29]:
df = pd.read_json('data/sampledf.json')
In [30]:
df
Out[30]:
0 1 2 3 4 5 6 7 8 9
0 79 19 21 99 35 59 44 25 75 58
1 25 39 89 66 9 41 6 69 63 3
2 37 64 31 69 61 97 5 11 76 57
3 74 61 100 6 58 80 95 50 15 51
4 79 60 83 85 16 5 16 69 5 20
5 45 26 73 73 100 60 21 19 95 12
6 12 29 18 98 62 68 92 29 74 96
7 36 32 22 4 66 25 63 51 59 14
8 55 53 89 13 84 87 74 3 2 64
9 46 74 36 54 21 12 68 33 80 25

DataFrame[n] selektiert die Spalte an der Position n

In [31]:
# visualisation of below - for presentation
display_html(df.style.apply(highlight, subset=pd.IndexSlice[:, 2]))

df[2]
0 1 2 3 4 5 6 7 8 9
0 79 19 21 99 35 59 44 25 75 58
1 25 39 89 66 9 41 6 69 63 3
2 37 64 31 69 61 97 5 11 76 57
3 74 61 100 6 58 80 95 50 15 51
4 79 60 83 85 16 5 16 69 5 20
5 45 26 73 73 100 60 21 19 95 12
6 12 29 18 98 62 68 92 29 74 96
7 36 32 22 4 66 25 63 51 59 14
8 55 53 89 13 84 87 74 3 2 64
9 46 74 36 54 21 12 68 33 80 25
Out[31]:
0     21
1     89
2     31
3    100
4     83
5     73
6     18
7     22
8     89
9     36
Name: 2, dtype: int64

Das Slice DataFrame[start:stop] selektiert die Zeilen von start bis stop (stop ist exklusive!)

In [32]:
# visualisation of below - for presentation
display_html(df.style.apply(highlight, subset=pd.IndexSlice[range(2, 4), :]))

# column
df[2:4]
0 1 2 3 4 5 6 7 8 9
0 79 19 21 99 35 59 44 25 75 58
1 25 39 89 66 9 41 6 69 63 3
2 37 64 31 69 61 97 5 11 76 57
3 74 61 100 6 58 80 95 50 15 51
4 79 60 83 85 16 5 16 69 5 20
5 45 26 73 73 100 60 21 19 95 12
6 12 29 18 98 62 68 92 29 74 96
7 36 32 22 4 66 25 63 51 59 14
8 55 53 89 13 84 87 74 3 2 64
9 46 74 36 54 21 12 68 33 80 25
Out[32]:
0 1 2 3 4 5 6 7 8 9
2 37 64 31 69 61 97 5 11 76 57
3 74 61 100 6 58 80 95 50 15 51

In [33]:
# visualisation of below - for presentation
display_html(df.style.apply(highlight, subset=pd.IndexSlice[range(2, 4), range(2, 4)]))


# segment
df.iloc[2:4, 2:4]
0 1 2 3 4 5 6 7 8 9
0 79 19 21 99 35 59 44 25 75 58
1 25 39 89 66 9 41 6 69 63 3
2 37 64 31 69 61 97 5 11 76 57
3 74 61 100 6 58 80 95 50 15 51
4 79 60 83 85 16 5 16 69 5 20
5 45 26 73 73 100 60 21 19 95 12
6 12 29 18 98 62 68 92 29 74 96
7 36 32 22 4 66 25 63 51 59 14
8 55 53 89 13 84 87 74 3 2 64
9 46 74 36 54 21 12 68 33 80 25
Out[33]:
2 3
2 31 69
3 100 6
In [34]:
# visualisation of below - for presentation
display_html(df.style.apply(highlight, subset=pd.IndexSlice[:, range(2, 4)]))

# column slice
df.iloc[:, 2:4]
0 1 2 3 4 5 6 7 8 9
0 79 19 21 99 35 59 44 25 75 58
1 25 39 89 66 9 41 6 69 63 3
2 37 64 31 69 61 97 5 11 76 57
3 74 61 100 6 58 80 95 50 15 51
4 79 60 83 85 16 5 16 69 5 20
5 45 26 73 73 100 60 21 19 95 12
6 12 29 18 98 62 68 92 29 74 96
7 36 32 22 4 66 25 63 51 59 14
8 55 53 89 13 84 87 74 3 2 64
9 46 74 36 54 21 12 68 33 80 25
Out[34]:
2 3
0 21 99
1 89 66
2 31 69
3 100 6
4 83 85
5 73 73
6 18 98
7 22 4
8 89 13
9 36 54

In [35]:
df
Out[35]:
0 1 2 3 4 5 6 7 8 9
0 79 19 21 99 35 59 44 25 75 58
1 25 39 89 66 9 41 6 69 63 3
2 37 64 31 69 61 97 5 11 76 57
3 74 61 100 6 58 80 95 50 15 51
4 79 60 83 85 16 5 16 69 5 20
5 45 26 73 73 100 60 21 19 95 12
6 12 29 18 98 62 68 92 29 74 96
7 36 32 22 4 66 25 63 51 59 14
8 55 53 89 13 84 87 74 3 2 64
9 46 74 36 54 21 12 68 33 80 25
In [36]:
df.index = ["R{:02d}".format(i) for i in range(len(df))]
In [37]:
df.columns = ["C{:02d}".format(i) for i in range(len(df.columns))]
In [38]:
df
Out[38]:
C00 C01 C02 C03 C04 C05 C06 C07 C08 C09
R00 79 19 21 99 35 59 44 25 75 58
R01 25 39 89 66 9 41 6 69 63 3
R02 37 64 31 69 61 97 5 11 76 57
R03 74 61 100 6 58 80 95 50 15 51
R04 79 60 83 85 16 5 16 69 5 20
R05 45 26 73 73 100 60 21 19 95 12
R06 12 29 18 98 62 68 92 29 74 96
R07 36 32 22 4 66 25 63 51 59 14
R08 55 53 89 13 84 87 74 3 2 64
R09 46 74 36 54 21 12 68 33 80 25
In [39]:
# visualisation of below - for presentation
display_html(df.style.apply(highlight, subset=pd.IndexSlice[:, 'C05']))

df['C05']
C00 C01 C02 C03 C04 C05 C06 C07 C08 C09
R00 79 19 21 99 35 59 44 25 75 58
R01 25 39 89 66 9 41 6 69 63 3
R02 37 64 31 69 61 97 5 11 76 57
R03 74 61 100 6 58 80 95 50 15 51
R04 79 60 83 85 16 5 16 69 5 20
R05 45 26 73 73 100 60 21 19 95 12
R06 12 29 18 98 62 68 92 29 74 96
R07 36 32 22 4 66 25 63 51 59 14
R08 55 53 89 13 84 87 74 3 2 64
R09 46 74 36 54 21 12 68 33 80 25
Out[39]:
R00    59
R01    41
R02    97
R03    80
R04     5
R05    60
R06    68
R07    25
R08    87
R09    12
Name: C05, dtype: int64
In [40]:
# visualisation of below - for presentation
display_html(df.style.apply(highlight, subset=pd.IndexSlice['R02':'R05', :]))


df['R02':'R05']
C00 C01 C02 C03 C04 C05 C06 C07 C08 C09
R00 79 19 21 99 35 59 44 25 75 58
R01 25 39 89 66 9 41 6 69 63 3
R02 37 64 31 69 61 97 5 11 76 57
R03 74 61 100 6 58 80 95 50 15 51
R04 79 60 83 85 16 5 16 69 5 20
R05 45 26 73 73 100 60 21 19 95 12
R06 12 29 18 98 62 68 92 29 74 96
R07 36 32 22 4 66 25 63 51 59 14
R08 55 53 89 13 84 87 74 3 2 64
R09 46 74 36 54 21 12 68 33 80 25
Out[40]:
C00 C01 C02 C03 C04 C05 C06 C07 C08 C09
R02 37 64 31 69 61 97 5 11 76 57
R03 74 61 100 6 58 80 95 50 15 51
R04 79 60 83 85 16 5 16 69 5 20
R05 45 26 73 73 100 60 21 19 95 12
In [41]:
# visualisation of below - for presentation
display_html(df.style.apply(highlight, subset=pd.IndexSlice[:, 'C02':'C05']))


df.loc[:, 'C02':'C05']
C00 C01 C02 C03 C04 C05 C06 C07 C08 C09
R00 79 19 21 99 35 59 44 25 75 58
R01 25 39 89 66 9 41 6 69 63 3
R02 37 64 31 69 61 97 5 11 76 57
R03 74 61 100 6 58 80 95 50 15 51
R04 79 60 83 85 16 5 16 69 5 20
R05 45 26 73 73 100 60 21 19 95 12
R06 12 29 18 98 62 68 92 29 74 96
R07 36 32 22 4 66 25 63 51 59 14
R08 55 53 89 13 84 87 74 3 2 64
R09 46 74 36 54 21 12 68 33 80 25
Out[41]:
C02 C03 C04 C05
R00 21 99 35 59
R01 89 66 9 41
R02 31 69 61 97
R03 100 6 58 80
R04 83 85 16 5
R05 73 73 100 60
R06 18 98 62 68
R07 22 4 66 25
R08 89 13 84 87
R09 36 54 21 12
In [42]:
# visualisation of below - for presentation
display_html(df.style.apply(highlight, subset=pd.IndexSlice['R02':'R05', 'C04':'C05']))


# segment
df.loc['R02':'R05', 'C04':'C05']
C00 C01 C02 C03 C04 C05 C06 C07 C08 C09
R00 79 19 21 99 35 59 44 25 75 58
R01 25 39 89 66 9 41 6 69 63 3
R02 37 64 31 69 61 97 5 11 76 57
R03 74 61 100 6 58 80 95 50 15 51
R04 79 60 83 85 16 5 16 69 5 20
R05 45 26 73 73 100 60 21 19 95 12
R06 12 29 18 98 62 68 92 29 74 96
R07 36 32 22 4 66 25 63 51 59 14
R08 55 53 89 13 84 87 74 3 2 64
R09 46 74 36 54 21 12 68 33 80 25
Out[42]:
C04 C05
R02 61 97
R03 58 80
R04 16 5
R05 100 60

Welcome to the Bluth Online Store

Banana

In [43]:
sales_data = pd.read_excel('data/blooth_sales_data_clean.xlsx')
sales_data.head(5)
Out[43]:
name birthday customer orderdate product units unitprice
0 Pasquale 1967-09-02 Electronics Inc 2016-07-17 13:48:03.157 Thriller record 2 13.27
1 India 1968-12-13 Electronics Resource Group 2016-07-06 13:48:03.157 Corolla 26 24458.69
2 Wayne 1992-09-10 East Application Contract Inc 2016-07-22 13:48:03.157 Rubik’s Cube 41 15.79
3 Cori 1986-11-05 Signal Industries 2016-07-23 13:48:03.157 iPhone 16 584.01
4 Chang 1972-04-23 Star Alpha Industries 2016-07-16 13:48:03.157 Harry Potter book 4 25.69
In [44]:
sales_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
name         1000 non-null object
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
orderdate    1000 non-null datetime64[ns]
product      1000 non-null object
units        1000 non-null int64
unitprice    1000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 54.8+ KB

Boolean Index

Ein Boolean Index (oder mask) ist ein Array mit True/False Werten: [1, 0, 1, 1, 0, 0, 1, …]

In [45]:
sales_data['units'] > 40
Out[45]:
0      False
1      False
2       True
3      False
4      False
5      False
6      False
7      False
8      False
9      False
10     False
11     False
12      True
13      True
14      True
15     False
16     False
17     False
18     False
19     False
20     False
21     False
22     False
23      True
24     False
25     False
26     False
27     False
28      True
29     False
       ...  
970    False
971     True
972    False
973    False
974    False
975    False
976     True
977    False
978    False
979     True
980    False
981    False
982    False
983     True
984    False
985    False
986     True
987    False
988    False
989    False
990    False
991    False
992    False
993    False
994    False
995    False
996    False
997    False
998    False
999    False
Name: units, Length: 1000, dtype: bool
In [46]:
sales_data[
    sales_data['units'] > 40
]
Out[46]:
name birthday customer orderdate product units unitprice
2 Wayne 1992-09-10 East Application Contract Inc 2016-07-22 13:48:03.157 Rubik’s Cube 41 15.79
12 Kelli 1982-05-10 Building Construction Group 2016-07-13 13:48:03.157 Corolla 43 24870.63
13 Rosalina 1967-10-24 Software Group 2016-07-09 13:48:03.157 PlayStation 43 242.63
14 Tracy 1984-11-29 Hill Pacific Analysis Co 2016-07-22 13:48:03.157 iPad 41 646.14
23 Chet 1980-12-18 Vision Speed International 2016-07-15 13:48:03.157 iPhone 44 500.83
28 Dana 1984-07-11 Net Speed Interactive Incorporated 2016-06-30 13:48:03.157 Corolla 44 20103.97
34 Rashad 1981-08-12 General Contract Corporation 2016-07-14 13:48:03.157 iPhone 47 871.32
36 Jody 1979-08-10 Star Digital Speed Organization 2016-07-21 13:48:03.157 Thriller record 41 13.64
39 Jackie 1956-06-27 Frontier Graphics Resource LLC 2016-07-10 13:48:03.157 PlayStation 50 283.83
46 Edward 1974-11-16 Analysis Technology LLC 2016-07-01 13:48:03.158 banana 46 10.00
47 Aurora 1977-09-23 Design Resource Future Inc 2016-07-05 13:48:03.158 Star Wars 47 11.41
54 Shellie 1953-08-09 Studio Corporation 2016-07-26 13:48:03.158 Corolla 47 22244.95
60 Zelma 1984-01-14 Medicine Inc 2016-07-10 13:48:03.158 Rubik’s Cube 45 18.34
67 Jeromy 1971-09-02 Data International 2016-07-06 13:48:03.158 Rubik’s Cube 41 17.08
73 Damaris 1984-06-01 North Innovation Vision Inc 2016-07-26 13:48:03.158 Rubik’s Cube 42 17.38
75 Shiela 1976-03-28 North Incorporated 2016-07-11 13:48:03.158 iPad 47 926.10
76 Ellsworth 1979-03-18 Application Systems Group 2016-07-11 13:48:03.158 banana 46 10.00
79 Shawn 1980-07-14 Universal Frontier Adventure Corporation 2016-07-12 13:48:03.158 Rubik’s Cube 44 18.10
80 Autumn 1970-04-02 Frontier Corporation 2016-07-22 13:48:03.158 Harry Potter book 48 34.31
83 Gerri 1996-09-20 Consulting Direct Data LLC 2016-07-27 13:48:03.158 banana 47 10.00
86 Cory 1987-03-24 Virtual Signal Innovation Inc 2016-07-11 13:48:03.158 Lipitor 45 10.50
89 Benton 1980-05-06 Star Data Incorporated 2016-07-03 13:48:03.158 Harry Potter book 46 13.45
92 Catina 1987-05-21 People LLC 2016-07-12 13:48:03.158 iPad 45 737.98
95 Bette 1989-11-14 People Group 2016-07-15 13:48:03.158 PlayStation 44 214.41
96 Chang 1984-02-05 Venture Alpha Corporation 2016-07-04 13:48:03.158 iPhone 46 690.25
99 Retha 1963-06-17 Hill Net Building Agency 2016-07-15 13:48:03.159 Thriller record 44 15.89
105 Edwina 1986-11-04 Pacific Organization 2016-07-10 13:48:03.159 Star Wars 46 11.06
122 Dorian 1985-03-25 Medicine Limited 2016-07-03 13:48:03.159 iPhone 47 549.29
127 Carissa 1970-03-14 Galaxy Co 2016-07-26 13:48:03.159 banana 42 10.00
130 Aurora 1977-09-23 Design Resource Future Inc 2016-07-24 13:48:03.159 Lipitor 46 10.27
... ... ... ... ... ... ... ...
864 Marion 1962-12-04 Virtual Limited 2016-07-14 13:48:03.175 Corolla 41 21389.06
870 Dewitt 1985-03-03 Building Future Provider Inc 2016-07-20 13:48:03.175 iPhone 44 744.17
874 Jannette 1973-09-10 Architecture Technology Solutions Corporation 2016-07-09 13:48:03.175 banana 42 10.00
878 Ilene 1979-06-02 Innovation Federated Industries 2016-07-09 13:48:03.176 banana 46 10.00
886 Larissa 1991-04-13 Power Virtual Organization 2016-07-25 13:48:03.176 Harry Potter book 44 6.46
891 Jesica 1987-08-15 Network Solutions Group 2016-07-04 13:48:03.176 Rubik’s Cube 45 16.87
893 Araceli 1951-10-15 Vision Co 2016-07-06 13:48:03.176 PlayStation 41 248.14
901 Joanne 1998-07-12 Hardware Studio Contract LLC 2016-07-10 13:48:03.176 Corolla 43 23250.64
911 Mayra 1985-03-06 Building Power Advanced Incorporated 2016-07-26 13:48:03.176 PlayStation 46 214.82
912 Jay 1980-06-26 Analysis West Limited 2016-07-04 13:48:03.176 Corolla 46 24129.19
914 Chang 1972-04-23 Star Alpha Industries 2016-07-14 13:48:03.176 Rubik’s Cube 43 17.31
915 Michael 1985-05-26 Advanced People Telecom Incorporated 2016-07-04 13:48:03.176 iPad 50 748.35
917 Dick 1979-03-21 Technology Contract Co 2016-07-25 13:48:03.176 Harry Potter book 41 18.49
920 Howard 1961-01-31 Federated Agency 2016-07-03 13:48:03.176 banana 41 10.00
922 Marquita 1981-11-09 Direct Limited 2016-07-16 13:48:03.176 Harry Potter book 41 30.64
923 Bridgette 1976-10-20 Solutions Industries 2016-07-21 13:48:03.176 Rubik’s Cube 44 17.77
933 Jeromy 1971-09-02 Data International 2016-07-07 13:48:03.177 Lipitor 46 10.57
940 Wendell 1979-08-26 Vision Application Industries Organization 2016-07-27 13:48:03.177 iPad 42 693.32
944 Faustino 1980-02-04 Analysis Limited 2016-07-07 13:48:03.177 PlayStation 46 265.40
945 Miguel 1966-07-30 Software Power International 2016-06-30 13:48:03.177 Thriller record 44 17.08
952 Kate 1985-06-29 Interactive Industries 2016-07-09 13:48:03.177 Rubik’s Cube 47 18.84
955 Garth 1983-12-02 Design Agency 2016-07-14 13:48:03.177 Harry Potter book 48 28.19
959 Otis 1973-02-24 Star Virtual Incorporated 2016-07-23 13:48:03.177 banana 48 10.00
963 Isreal 1979-06-24 Pacific International 2016-07-10 13:48:03.177 Corolla 41 24879.32
968 Olive 1980-09-13 Data Building Corporation 2016-07-05 13:48:03.177 Corolla 41 21952.38
971 Lemuel 1985-09-01 Analysis Graphics Provider Incorporated 2016-07-07 13:48:03.177 Rubik’s Cube 44 15.73
976 Dong 1970-09-06 Solutions Contract Software Corporation 2016-07-01 13:48:03.178 PlayStation 45 214.58
979 Odis 1985-07-05 Technology International 2016-07-05 13:48:03.178 Harry Potter book 43 9.15
983 Darron 1971-10-19 Interactive International 2016-07-01 13:48:03.178 Corolla 50 22776.03
986 Mitzi 1978-06-28 Design Solutions Co 2016-07-13 13:48:03.178 iPad 46 333.48

215 rows × 7 columns

In [47]:
sales_data[
    (sales_data['units'] > 40) & (sales_data['unitprice'] < 10)  # AND
]
Out[47]:
name birthday customer orderdate product units unitprice
200 Darron 1971-10-19 Interactive International 2016-07-25 13:48:03.160 Star Wars 50 9.99
205 Garth 1983-12-02 Design Agency 2016-07-22 13:48:03.161 Star Wars 42 8.35
209 Jannette 1973-09-10 Architecture Technology Solutions Corporation 2016-07-04 13:48:03.161 Star Wars 41 9.71
249 Ilene 1979-06-02 Innovation Federated Industries 2016-07-15 13:48:03.161 Star Wars 42 9.23
317 Jimmie 1965-06-04 Star Limited 2016-07-11 13:48:03.163 Star Wars 46 8.05
383 Kerry 1982-04-20 Medicine Direct Incorporated 2016-07-04 13:48:03.164 Star Wars 42 8.37
415 Garth 1983-12-02 Design Agency 2016-07-19 13:48:03.165 Star Wars 45 8.11
458 Annette 1976-03-24 East People Co 2016-07-24 13:48:03.166 Star Wars 47 8.95
465 Fernando 1979-06-27 Building Research Venture Organization 2016-07-02 13:48:03.166 Harry Potter book 44 8.70
466 Sandra 1968-01-21 Net LLC 2016-07-17 13:48:03.166 Star Wars 50 9.78
507 Rayford 1969-12-30 Network Bell Digital International 2016-07-21 13:48:03.167 Star Wars 43 9.29
538 Jackie 1956-06-27 Frontier Graphics Resource LLC 2016-07-13 13:48:03.167 Harry Potter book 49 6.56
576 Sharla 1971-05-15 Net Limited 2016-07-15 13:48:03.168 Star Wars 42 8.56
577 Jerrod 1979-06-01 Design Solutions Corporation 2016-07-15 13:48:03.169 Star Wars 47 8.62
695 Ellsworth 1979-03-18 Application Systems Group 2016-07-20 13:48:03.172 Star Wars 45 8.75
886 Larissa 1991-04-13 Power Virtual Organization 2016-07-25 13:48:03.176 Harry Potter book 44 6.46
979 Odis 1985-07-05 Technology International 2016-07-05 13:48:03.178 Harry Potter book 43 9.15
In [48]:
sales_data[
    (sales_data['units'] > 40) | (sales_data['unitprice'] < 10)  # OR
]
Out[48]:
name birthday customer orderdate product units unitprice
2 Wayne 1992-09-10 East Application Contract Inc 2016-07-22 13:48:03.157 Rubik’s Cube 41 15.79
10 Odis 1985-07-05 Technology International 2016-07-15 13:48:03.157 Star Wars 27 8.09
12 Kelli 1982-05-10 Building Construction Group 2016-07-13 13:48:03.157 Corolla 43 24870.63
13 Rosalina 1967-10-24 Software Group 2016-07-09 13:48:03.157 PlayStation 43 242.63
14 Tracy 1984-11-29 Hill Pacific Analysis Co 2016-07-22 13:48:03.157 iPad 41 646.14
23 Chet 1980-12-18 Vision Speed International 2016-07-15 13:48:03.157 iPhone 44 500.83
28 Dana 1984-07-11 Net Speed Interactive Incorporated 2016-06-30 13:48:03.157 Corolla 44 20103.97
34 Rashad 1981-08-12 General Contract Corporation 2016-07-14 13:48:03.157 iPhone 47 871.32
36 Jody 1979-08-10 Star Digital Speed Organization 2016-07-21 13:48:03.157 Thriller record 41 13.64
39 Jackie 1956-06-27 Frontier Graphics Resource LLC 2016-07-10 13:48:03.157 PlayStation 50 283.83
46 Edward 1974-11-16 Analysis Technology LLC 2016-07-01 13:48:03.158 banana 46 10.00
47 Aurora 1977-09-23 Design Resource Future Inc 2016-07-05 13:48:03.158 Star Wars 47 11.41
54 Shellie 1953-08-09 Studio Corporation 2016-07-26 13:48:03.158 Corolla 47 22244.95
60 Zelma 1984-01-14 Medicine Inc 2016-07-10 13:48:03.158 Rubik’s Cube 45 18.34
67 Jeromy 1971-09-02 Data International 2016-07-06 13:48:03.158 Rubik’s Cube 41 17.08
71 Alecia 1984-02-11 Internet Hill Studio Incorporated 2016-07-07 13:48:03.158 Star Wars 9 9.77
73 Damaris 1984-06-01 North Innovation Vision Inc 2016-07-26 13:48:03.158 Rubik’s Cube 42 17.38
75 Shiela 1976-03-28 North Incorporated 2016-07-11 13:48:03.158 iPad 47 926.10
76 Ellsworth 1979-03-18 Application Systems Group 2016-07-11 13:48:03.158 banana 46 10.00
79 Shawn 1980-07-14 Universal Frontier Adventure Corporation 2016-07-12 13:48:03.158 Rubik’s Cube 44 18.10
80 Autumn 1970-04-02 Frontier Corporation 2016-07-22 13:48:03.158 Harry Potter book 48 34.31
83 Gerri 1996-09-20 Consulting Direct Data LLC 2016-07-27 13:48:03.158 banana 47 10.00
86 Cory 1987-03-24 Virtual Signal Innovation Inc 2016-07-11 13:48:03.158 Lipitor 45 10.50
89 Benton 1980-05-06 Star Data Incorporated 2016-07-03 13:48:03.158 Harry Potter book 46 13.45
92 Catina 1987-05-21 People LLC 2016-07-12 13:48:03.158 iPad 45 737.98
95 Bette 1989-11-14 People Group 2016-07-15 13:48:03.158 PlayStation 44 214.41
96 Chang 1984-02-05 Venture Alpha Corporation 2016-07-04 13:48:03.158 iPhone 46 690.25
99 Retha 1963-06-17 Hill Net Building Agency 2016-07-15 13:48:03.159 Thriller record 44 15.89
105 Edwina 1986-11-04 Pacific Organization 2016-07-10 13:48:03.159 Star Wars 46 11.06
111 Perla 1982-09-10 Innovation Alpha LLC 2016-07-05 13:48:03.159 Thriller record 37 9.92
... ... ... ... ... ... ... ...
881 Autumn 1970-04-02 Frontier Corporation 2016-07-15 13:48:03.176 Thriller record 3 9.64
886 Larissa 1991-04-13 Power Virtual Organization 2016-07-25 13:48:03.176 Harry Potter book 44 6.46
891 Jesica 1987-08-15 Network Solutions Group 2016-07-04 13:48:03.176 Rubik’s Cube 45 16.87
893 Araceli 1951-10-15 Vision Co 2016-07-06 13:48:03.176 PlayStation 41 248.14
896 Antony 1973-05-04 Solutions Innovation Telecom Organization 2016-07-14 13:48:03.176 Star Wars 24 8.84
901 Joanne 1998-07-12 Hardware Studio Contract LLC 2016-07-10 13:48:03.176 Corolla 43 23250.64
911 Mayra 1985-03-06 Building Power Advanced Incorporated 2016-07-26 13:48:03.176 PlayStation 46 214.82
912 Jay 1980-06-26 Analysis West Limited 2016-07-04 13:48:03.176 Corolla 46 24129.19
914 Chang 1972-04-23 Star Alpha Industries 2016-07-14 13:48:03.176 Rubik’s Cube 43 17.31
915 Michael 1985-05-26 Advanced People Telecom Incorporated 2016-07-04 13:48:03.176 iPad 50 748.35
917 Dick 1979-03-21 Technology Contract Co 2016-07-25 13:48:03.176 Harry Potter book 41 18.49
920 Howard 1961-01-31 Federated Agency 2016-07-03 13:48:03.176 banana 41 10.00
922 Marquita 1981-11-09 Direct Limited 2016-07-16 13:48:03.176 Harry Potter book 41 30.64
923 Bridgette 1976-10-20 Solutions Industries 2016-07-21 13:48:03.176 Rubik’s Cube 44 17.77
933 Jeromy 1971-09-02 Data International 2016-07-07 13:48:03.177 Lipitor 46 10.57
940 Wendell 1979-08-26 Vision Application Industries Organization 2016-07-27 13:48:03.177 iPad 42 693.32
944 Faustino 1980-02-04 Analysis Limited 2016-07-07 13:48:03.177 PlayStation 46 265.40
945 Miguel 1966-07-30 Software Power International 2016-06-30 13:48:03.177 Thriller record 44 17.08
952 Kate 1985-06-29 Interactive Industries 2016-07-09 13:48:03.177 Rubik’s Cube 47 18.84
955 Garth 1983-12-02 Design Agency 2016-07-14 13:48:03.177 Harry Potter book 48 28.19
959 Otis 1973-02-24 Star Virtual Incorporated 2016-07-23 13:48:03.177 banana 48 10.00
960 Colby 1982-01-31 General Group 2016-07-24 13:48:03.177 Thriller record 23 8.28
963 Isreal 1979-06-24 Pacific International 2016-07-10 13:48:03.177 Corolla 41 24879.32
968 Olive 1980-09-13 Data Building Corporation 2016-07-05 13:48:03.177 Corolla 41 21952.38
971 Lemuel 1985-09-01 Analysis Graphics Provider Incorporated 2016-07-07 13:48:03.177 Rubik’s Cube 44 15.73
976 Dong 1970-09-06 Solutions Contract Software Corporation 2016-07-01 13:48:03.178 PlayStation 45 214.58
979 Odis 1985-07-05 Technology International 2016-07-05 13:48:03.178 Harry Potter book 43 9.15
983 Darron 1971-10-19 Interactive International 2016-07-01 13:48:03.178 Corolla 50 22776.03
986 Mitzi 1978-06-28 Design Solutions Co 2016-07-13 13:48:03.178 iPad 46 333.48
997 Annmarie 1982-06-04 Atlantic Corporation 2016-07-13 13:48:03.178 Thriller record 19 9.16

277 rows × 7 columns

Multi -Index

In [49]:
somany = 20
dfm = pd.DataFrame(
    {'city':[x for x in ['Paris', 'London', 'Berlin', 'New York', 'Mannheim']*10][:20],
      'category': [random.randint(1, 3) for x in range(somany)],
      'price': [random.randint(150, 300) for x in range(somany)],
      'rating': [random.randint(0, 10) for x in range(somany)]})
dfm['country'] = dfm['city'].map(
    {'Paris': 'FR', 'London': 'GB', 'Berlin': 'DE', 'New York': 'US', 'Mannheim': 'DE'})
In [50]:
dfm
Out[50]:
category city price rating country
0 3 Paris 212 1 FR
1 3 London 162 10 GB
2 2 Berlin 269 0 DE
3 1 New York 160 6 US
4 1 Mannheim 214 6 DE
5 1 Paris 235 3 FR
6 3 London 251 0 GB
7 3 Berlin 277 6 DE
8 1 New York 263 2 US
9 1 Mannheim 163 4 DE
10 3 Paris 250 5 FR
11 3 London 288 5 GB
12 2 Berlin 206 5 DE
13 3 New York 201 8 US
14 2 Mannheim 168 9 DE
15 2 Paris 280 5 FR
16 1 London 182 3 GB
17 1 Berlin 290 9 DE
18 1 New York 206 5 US
19 3 Mannheim 199 0 DE
In [51]:
dfg = dfm.groupby(['country', 'city', 'category']).mean()
dfg
Out[51]:
price rating
country city category
DE Berlin 1 290.000000 9.000000
2 237.500000 2.500000
3 277.000000 6.000000
Mannheim 1 188.500000 5.000000
2 168.000000 9.000000
3 199.000000 0.000000
FR Paris 1 235.000000 3.000000
2 280.000000 5.000000
3 231.000000 3.000000
GB London 1 182.000000 3.000000
3 233.666667 5.000000
US New York 1 209.666667 4.333333
3 201.000000 8.000000
In [52]:
dfg.index
Out[52]:
MultiIndex(levels=[['DE', 'FR', 'GB', 'US'], ['Berlin', 'London', 'Mannheim', 'New York', 'Paris'], [1, 2, 3]],
           labels=[[0, 0, 0, 0, 0, 0, 1, 1, 1, 2, 2, 3, 3], [0, 0, 0, 2, 2, 2, 4, 4, 4, 1, 1, 3, 3], [0, 1, 2, 0, 1, 2, 0, 1, 2, 0, 2, 0, 2]],
           names=['country', 'city', 'category'])
In [53]:
dfg.index.levels
Out[53]:
FrozenList([['DE', 'FR', 'GB', 'US'], ['Berlin', 'London', 'Mannheim', 'New York', 'Paris'], [1, 2, 3]])
In [54]:
dfg.index.names
Out[54]:
FrozenList(['country', 'city', 'category'])
In [55]:
pd.set_option('display.multi_sparse', False)  # Tipp
dfg
Out[55]:
price rating
country city category
DE Berlin 1 290.000000 9.000000
DE Berlin 2 237.500000 2.500000
DE Berlin 3 277.000000 6.000000
DE Mannheim 1 188.500000 5.000000
DE Mannheim 2 168.000000 9.000000
DE Mannheim 3 199.000000 0.000000
FR Paris 1 235.000000 3.000000
FR Paris 2 280.000000 5.000000
FR Paris 3 231.000000 3.000000
GB London 1 182.000000 3.000000
GB London 3 233.666667 5.000000
US New York 1 209.666667 4.333333
US New York 3 201.000000 8.000000
In [56]:
pd.set_option('display.multi_sparse', True)
In [57]:
dfg.index.get_level_values(2)
Out[57]:
Int64Index([1, 2, 3, 1, 2, 3, 1, 2, 3, 1, 3, 1, 3], dtype='int64', name='category')
In [58]:
dfg.index.get_level_values(1)
Out[58]:
Index(['Berlin', 'Berlin', 'Berlin', 'Mannheim', 'Mannheim', 'Mannheim',
       'Paris', 'Paris', 'Paris', 'London', 'London', 'New York', 'New York'],
      dtype='object', name='city')
In [59]:
dfg.loc['DE']
Out[59]:
price rating
city category
Berlin 1 290.0 9.0
2 237.5 2.5
3 277.0 6.0
Mannheim 1 188.5 5.0
2 168.0 9.0
3 199.0 0.0
In [60]:
dfg.loc[('DE', 'Mannheim')]
Out[60]:
price rating
category
1 188.5 5.0
2 168.0 9.0
3 199.0 0.0
In [61]:
dfg.loc[('DE', 'Mannheim')].max()  # max of column
Out[61]:
price     199.0
rating      9.0
dtype: float64
In [62]:
dfg.loc[('DE', 'Mannheim', 3)]
Out[62]:
price     199.0
rating      0.0
Name: (DE, Mannheim, 3), dtype: float64
In [63]:
type(dfg.loc[('DE', 'Mannheim', 3)])  # know the type you're working with
Out[63]:
pandas.core.series.Series
In [64]:
dfg.unstack()
Out[64]:
price rating
category 1 2 3 1 2 3
country city
DE Berlin 290.000000 237.5 277.000000 9.000000 2.5 6.0
Mannheim 188.500000 168.0 199.000000 5.000000 9.0 0.0
FR Paris 235.000000 280.0 231.000000 3.000000 5.0 3.0
GB London 182.000000 NaN 233.666667 3.000000 NaN 5.0
US New York 209.666667 NaN 201.000000 4.333333 NaN 8.0
In [65]:
dfg.unstack().unstack()
Out[65]:
price ... rating
category 1 2 ... 2 3
city Berlin London Mannheim New York Paris Berlin London Mannheim New York Paris ... Berlin London Mannheim New York Paris Berlin London Mannheim New York Paris
country
DE 290.0 NaN 188.5 NaN NaN 237.5 NaN 168.0 NaN NaN ... 2.5 NaN 9.0 NaN NaN 6.0 NaN 0.0 NaN NaN
FR NaN NaN NaN NaN 235.0 NaN NaN NaN NaN 280.0 ... NaN NaN NaN NaN 5.0 NaN NaN NaN NaN 3.0
GB NaN 182.0 NaN NaN NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN 5.0 NaN NaN NaN
US NaN NaN NaN 209.666667 NaN NaN NaN NaN NaN NaN ... NaN NaN NaN NaN NaN NaN NaN NaN 8.0 NaN

4 rows × 30 columns


DateTime Index

In [66]:
df = pd.read_csv('data/tempm.csv', header=None)
df.columns = ['timestamp', 'celsius']
df.head(5)
Out[66]:
timestamp celsius
0 2014-09-26T03:50:00 14.0
1 2014-08-10T05:00:00 14.0
2 2014-08-21T22:50:00 12.0
3 2014-08-17T13:20:00 16.0
4 2014-08-06T01:20:00 14.0
In [67]:
df[:100].plot()
Out[67]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c637278>
In [68]:
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4357 entries, 0 to 4356
Data columns (total 2 columns):
timestamp    4357 non-null object
celsius      4354 non-null float64
dtypes: float64(1), object(1)
memory usage: 68.2+ KB
In [69]:
df.index
Out[69]:
RangeIndex(start=0, stop=4357, step=1)
In [70]:
df.index = pd.to_datetime(df['timestamp'])

Easy way to work with timeseries

In [71]:
df.head()
Out[71]:
timestamp celsius
timestamp
2014-09-26 03:50:00 2014-09-26T03:50:00 14.0
2014-08-10 05:00:00 2014-08-10T05:00:00 14.0
2014-08-21 22:50:00 2014-08-21T22:50:00 12.0
2014-08-17 13:20:00 2014-08-17T13:20:00 16.0
2014-08-06 01:20:00 2014-08-06T01:20:00 14.0
In [72]:
df.index
Out[72]:
DatetimeIndex(['2014-09-26 03:50:00', '2014-08-10 05:00:00',
               '2014-08-21 22:50:00', '2014-08-17 13:20:00',
               '2014-08-06 01:20:00', '2014-09-27 06:50:00',
               '2014-08-25 21:50:00', '2014-08-14 05:20:00',
               '2014-09-14 05:20:00', '2014-08-03 02:50:00',
               ...
               '2014-08-22 10:00:00', '2014-09-10 16:20:00',
               '2014-08-14 15:50:00', '2014-09-05 04:00:00',
               '2014-09-29 02:50:00', '2014-08-21 01:50:00',
               '2014-09-13 10:00:00', '2014-08-16 23:20:00',
               '2014-09-28 10:20:00', '2014-09-25 18:20:00'],
              dtype='datetime64[ns]', name='timestamp', length=4357, freq=None)
In [73]:
df.index.is_monotonic
Out[73]:
False
In [74]:
df.index.is_unique
Out[74]:
True
In [75]:
df[:100].plot()
Out[75]:
<matplotlib.axes._subplots.AxesSubplot at 0x11c3f9518>
In [76]:
df['celsius'].groupby([df.index.year, df.index.week]).mean().plot.bar();
In [77]:
df['celsius'].groupby([df.index.year, df.index.dayofyear]).mean().plot();
In [78]:
df['celsius'].groupby([df.index.year, df.index.dayofyear]).agg([min, max]).plot();
In [79]:
df['weekday'] = df.index.weekday
df['weekend'] = df['weekday'].isin({5, 6})
df.groupby(['weekend', df.index.hour])['celsius'].mean().unstack(level=0).plot()
Out[79]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cd1b080>
In [80]:
# selecting ranges
df["2014-08-27":"2014-08-28"]['celsius'].plot()
Out[80]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cd4a240>
In [81]:
df[(df.index.hour > 12) & (df.index.hour <=16)]['celsius'].plot()
Out[81]:
<matplotlib.axes._subplots.AxesSubplot at 0x11cd950b8>

Resampling

In [82]:
df['celsius'].resample('D').max().head()
Out[82]:
timestamp
2014-08-01    25.0
2014-08-02    27.0
2014-08-03    25.0
2014-08-04    24.0
2014-08-05    23.0
Freq: D, Name: celsius, dtype: float64
In [83]:
df['celsius'].resample('M').mean().head()
Out[83]:
timestamp
2014-08-31    16.436652
2014-09-30    14.719216
Freq: M, Name: celsius, dtype: float64
In [84]:
df['celsius'].resample('3D').mean().plot();
In [85]:
df['celsius'].resample('D').agg(['min', 'max']).plot();

refrence.jpg

Categorical

In [86]:
sales_data.head()
Out[86]:
name birthday customer orderdate product units unitprice
0 Pasquale 1967-09-02 Electronics Inc 2016-07-17 13:48:03.157 Thriller record 2 13.27
1 India 1968-12-13 Electronics Resource Group 2016-07-06 13:48:03.157 Corolla 26 24458.69
2 Wayne 1992-09-10 East Application Contract Inc 2016-07-22 13:48:03.157 Rubik’s Cube 41 15.79
3 Cori 1986-11-05 Signal Industries 2016-07-23 13:48:03.157 iPhone 16 584.01
4 Chang 1972-04-23 Star Alpha Industries 2016-07-16 13:48:03.157 Harry Potter book 4 25.69
In [87]:
sales_data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 7 columns):
name         1000 non-null object
birthday     1000 non-null datetime64[ns]
customer     1000 non-null object
orderdate    1000 non-null datetime64[ns]
product      1000 non-null object
units        1000 non-null int64
unitprice    1000 non-null float64
dtypes: datetime64[ns](2), float64(1), int64(1), object(3)
memory usage: 54.8+ KB
In [88]:
sales_data.memory_usage()
Out[88]:
Index          80
name         8000
birthday     8000
customer     8000
orderdate    8000
product      8000
units        8000
unitprice    8000
dtype: int64
In [89]:
m1 = sales_data.memory_usage(deep=True)
m1
Out[89]:
Index           80
name         62814
birthday      8000
customer     81070
orderdate     8000
product      70166
units         8000
unitprice     8000
dtype: int64
In [90]:
sales_data = sales_data.astype({'customer': 'category', 'product': 'category', 'name': 'category'})
In [91]:
m2 = sales_data.memory_usage(deep=True)
m2
Out[91]:
Index           80
name         29131
birthday      8000
customer     35421
orderdate     8000
product       2036
units         8000
unitprice     8000
dtype: int64
In [92]:
m2/m1
Out[92]:
Index        1.000000
name         0.463766
birthday     1.000000
customer     0.436919
orderdate    1.000000
product      0.029017
units        1.000000
unitprice    1.000000
dtype: float64
In [93]:
m1.sum()-m2.sum(), m2.sum()/m1.sum()
Out[93]:
(147462, 0.40087758501604842)
In [94]:
sales_data.index = sales_data['product']
In [95]:
sales_data.head()
Out[95]:
name birthday customer orderdate product units unitprice
product
Thriller record Pasquale 1967-09-02 Electronics Inc 2016-07-17 13:48:03.157 Thriller record 2 13.27
Corolla India 1968-12-13 Electronics Resource Group 2016-07-06 13:48:03.157 Corolla 26 24458.69
Rubik’s Cube Wayne 1992-09-10 East Application Contract Inc 2016-07-22 13:48:03.157 Rubik’s Cube 41 15.79
iPhone Cori 1986-11-05 Signal Industries 2016-07-23 13:48:03.157 iPhone 16 584.01
Harry Potter book Chang 1972-04-23 Star Alpha Industries 2016-07-16 13:48:03.157 Harry Potter book 4 25.69
In [96]:
favorites = ['Harry Potter book', 'Rubik’s Cube', 'Corolla', 'Rubik’s Cube', 'iPhone', 'iPad',
             'PlayStation', 'Star Wars', 'Thriller record',
             'banana', 'Lipitor']
In [97]:
pd.Categorical(['a', 'b', 'c', 'a', 'b', 'c'])
Out[97]:
[a, b, c, a, b, c]
Categories (3, object): [a, b, c]
In [98]:
a = pd.Categorical(values=['bad', 'neutral', 'nice', 'good', 'excellent'], 
                   categories=['bad', 'neutral', 'nice', 'good', 'excellent'],
                   ordered=True)
a.max()
Out[98]:
'excellent'
In [99]:
a.min()
Out[99]:
'bad'
In [ ]:
 
In [ ]:
 
In [ ]: