Python 2: Pandas fundas – Anything Software $

The Series Data Structure

import pandas as pd

Getting inline help

pd.Series?
-----------------------------------------------------------------------
Init signature:
pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Docstring:
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).

Operations between Series (+, -, /, *, **) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes.

Parameters
----------
data : array-like, dict, or scalar value
Contains data stored in Series
.. versionchanged :: 0.23.0
If data is a dict, argument order is maintained for Python 3.6 and later.
index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex (0, 1, 2, ..., n) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict.
dtype : numpy.dtype or None
If None, dtype will be inferred
copy : boolean, default False
Copy input data
File: ~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py Type: type

animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
-----------------------------------------------------------------------
0    Tiger 
1    Bear 
2    Moose 
dtype: object

numbers = [1, 2, 3]
print(pd.Series(numbers))
num = [1,2,3.4,4,4]
print(pd.Series(num, ['a','b','c','d','5']))
-----------------------------------------------------------------------
0    1 
1    2 
2    3 
dtype: int64 
a    1.0 
b    2.0 
c    3.4 
d    4.0 
5    4.0 
dtype: float64

# For string data it uses None when there is None
animals = ['Tiger', 'Bear', None] 
pd.Series(animals)
-----------------------------------------------------------------------
0     Tiger 
1     Bear 
2     None 
dtype: object

# For number data it uses NaN (Not a Number) when there is None or no data 
numbers = [1, 2, None] 
x = pd.Series(numbers)
print(x)
print(x[0] == 1.0)
print(x.loc[0] == x.iloc[0])
-----------------------------------------------------------------------
0    1.0 
1    2.0 
2    NaN 
dtype: float64 
True 
True

import numpy as np
print(np.nan == None) # NaN is not None
-----------------------------------------------------------------------
False

np.nan == np.nan # NaN is not equal to NaN
-----------------------------------------------------------------------
False

print(np.isnan(np.nan)) # only way to check for NaN is by using isnan()
print(np.isnan(x[2]))
-----------------------------------------------------------------------
True
True

sports = {'Archery': 'Bhutan',
           'Golf': 'Scotland',
           'Sumo': 'Japan',
           'Taekwondo': 'South Korea'}
s = pd.Series(sports) # instead of series we gave dictionary
s
-----------------------------------------------------------------------
Archery           Bhutan 
Golf            Scotland 
Sumo               Japan 
Taekwondo    South Korea 
dtype: object

print(s.index) # returns all indexes; it is of type pd.indexes.base.Index
type(s.index)
-----------------------------------------------------------------------
Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')

pandas.indexes.base.Index

s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada']) 
# Number of data and indexes must match 
print(s)
-----------------------------------------------------------------------
India      Tiger 
America     Bear 
Canada     Moose 
dtype: object

t = pd.Series([‘Tiger’, ‘Bear’, ‘Moose’], index=[‘India’, ‘America’]) # this will throw error
u = pd.Series([‘Tiger’, ‘Bear’], index=[‘India’, ‘America’, ‘Canada’]) # this will throw error

sports = {'Archery': 'Bhutan',
           'Golf': 'Scotland',
           'Sumo': None,
           'Taekwondo': np.nan} # None and NaN can co-exist together
a = pd.Series(sports) 
# sports itself can make a series with indexes set 
# as per key values in dictionary
print(a)
# when additional index is provided along with a disctionary type element in
# data a new series is created with the provided index only. In this case 
# Archery will not be in the new series  and Hockey which don't have any data 
# gets added with NaN
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
print(s)
np.isnan(s['Hockey'])
-----------------------------------------------------------------------
Archery        Bhutan 
Golf         Scotland 
Sumo             None 
Taekwondo         NaN 
dtype: object 
Golf      Scotland 
Sumo          None 
Hockey         NaN 
dtype: object 

True

Querying a Series

sports = {'Archery': 'Bhutan',
           'Golf': 'Scotland',
           'Sumo': 'Japan',
           'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
-----------------------------------------------------------------------
Archery           Bhutan 
Golf            Scotland 
Sumo               Japan 
Taekwondo    South Korea 
dtype: object

s.iloc[3]
-----------------------------------------------------------------------
'South Korea'

s.loc['Taekwondo']
-----------------------------------------------------------------------
'South Korea'

s[3]
-----------------------------------------------------------------------
'South Korea'

s['Golf']
-----------------------------------------------------------------------
'Scotland'

sports = {99: 'Bhutan',
           100: 'Scotland',
           101: 'Japan',
           102: 'South Korea'}
s = pd.Series(sports)
s
-----------------------------------------------------------------------
99          Bhutan 
100       Scotland 
101          Japan 
102    South Korea 
dtype: object

print(s.iloc[0])
print(s.loc[99])
print(s[99])
print(s.iloc[0] == s.loc[99])
print(s.iloc[0] == s[99])
# s[0] #This won't call s.iloc[0] as one might expect, it generates an error
# print(s.iloc[99]) # throws exception
-----------------------------------------------------------------------
Bhutan 
Bhutan 
Bhutan 
True 
True

s = pd.Series([100.00, 120.00, 101.00, 3.00])
s
-----------------------------------------------------------------------
0    100.0 
1    120.0 
2    101.0 
3      3.0 
dtype: float64

total = 0
for item in s:
    total+=item
print(total)
-----------------------------------------------------------------------
324.0

import numpy as np

total = np.sum(s)
print(total)
-----------------------------------------------------------------------
324.0

#this creates a big series of random numbers
np.random.randint?
-----------------------------------------------------------------------
randint(low, high=None, size=None)
#  Return random integers from low (inclusive) to high (exclusive).
#  Return random integers from the "discrete uniform" distribution in the
# "half-open" interval [ low, high). If high  is None (the default),
# then results are from [0, low).
 
 Parameters
 ----------
 low : int
 Lowest (signed) integer to be drawn from the distribution (unless
 high=None, in which case this parameter is the highest such
 integer).
 high : int, optional
 If provided, one above the largest (signed) integer to be drawn
 from the distribution (see above for behavior if high=None).
 size : int or tuple of ints, optional
 Output shape.  If the given shape is, e.g., (m, n, k), then
 m * n * k samples are drawn.  Default is None, in which case a
 single value is returned.
 #
 Returns
 -------
 out : int or ndarray of ints
 size-shaped array of random integers from the appropriate
 distribution, or a single such random int if size not provided.

s = pd.Series(np.random.randint(0,1000,10000))
print(s.head())
print(s.tail())
len(s)
-----------------------------------------------------------------------
0    615 
1    655 
2    505 
3    505 
4    566 
dtype: int64 
9995    662 
9996    710 
9997    694 
9998     87 
9999    701 
dtype: int64

10000

%%timeit -n 100
summary = 0
for item in s:
    summary+=item
-----------------------------------------------------------------------
881 µs ± 50.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 100
summary = np.sum(s)
-----------------------------------------------------------------------
100 loops, best of 3: 187 µs per loop

summary = np.sum(s)
print("Total: {} Average: {}".format(summary, summary/len(s)) )
s+=2 #adds two to each item in s using broadcasting
s.head()
-----------------------------------------------------------------------
Total: 5024510 Average: 502.451

0    617 
1    657 
2    507 
3    507 
4    568 
dtype: int64

%%timeit -n 100
 for label, value in s.iteritems():
     s.set_value(label, value+2)
# Another way of iterating over all data and updating, 
# updating in this way is slow
-----------------------------------------------------------------------
82.5 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
    s.loc[label]= value+2
# updating in this way is slower; its the WORST
-----------------------------------------------------------------------
10 loops, best of 3: 1.58 s per loop

%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2
# broadcasting is the fastest way to update; ITS THE BEST
-----------------------------------------------------------------------
10 loops, best of 3: 427 µs per loop

s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears' #adds new item to the series
s
-----------------------------------------------------------------------
0             1 
1             2 
2             3 
Animal    Bears 
dtype: object

original_sports = pd.Series({'Archery': 'Bhutan',
                              'Golf': 'Scotland',
                              'Sumo': 'Japan',
                              'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
                                       'Barbados',
                                       'Pakistan',
                                       'England'], 
                                    index=['Cricket',
                                           'Cricket',
                                           'Cricket',
                                           'Cricket'])
# correct way of appending a series
all_countries = original_sports.append(cricket_loving_countries) 
print(all_countries)

# dont give desiered output
new_all_countries = original_sports + cricket_loving_countries 
print(new_all_countries)
-----------------------------------------------------------------------

Archery           Bhutan 
Golf            Scotland 
Sumo               Japan 
Taekwondo    South Korea 
Cricket        Australia 
Cricket         Barbados 
Cricket         Pakistan 
Cricket          England 
dtype: object 

Archery      NaN 
Cricket      NaN 
Cricket      NaN 
Cricket      NaN 
Cricket      NaN 
Golf         NaN 
Sumo         NaN 
Taekwondo    NaN 
dtype: object

original_sports
-----------------------------------------------------------------------
Archery           Bhutan 
Golf            Scotland 
Sumo               Japan 
Taekwondo    South Korea 
dtype: object

cricket_loving_countries
-----------------------------------------------------------------------
Cricket    Australia 
Cricket     Barbados 
Cricket     Pakistan 
Cricket      England 
dtype: object

all_countries
-----------------------------------------------------------------------

Archery           Bhutan 
Golf            Scotland 
Sumo               Japan 
Taekwondo    South Korea 
Cricket        Australia 
Cricket         Barbados 
Cricket         Pakistan 
Cricket          England 
dtype: object

all_countries.loc['Cricket'] 
# this can retrieve multiple rows; like a hash table that allow conflicts
-----------------------------------------------------------------------
Cricket    Australia 
Cricket     Barbados 
Cricket     Pakistan 
Cricket      England 
dtype: object

The DataFrame Data Structure

import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
                         'Item Purchased': 'Dog Food',
                         'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
                         'Item Purchased': 'Kitty Litter',
                         'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
                         'Item Purchased': 'Bird Seed',
                         'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
print(df.head())
new_df = pd.DataFrame([purchase_1, purchase_2, purchase_3])
print(new_df.head())
-----------------------------------------------------------------------
         Cost Item Purchased   Name 
Store 1  22.5       Dog Food  Chris 
Store 1   2.5   Kitty Litter  Kevyn 
Store 2   5.0      Bird Seed  Vinod    
Cost Item Purchased   Name 
0  22.5       Dog Food  Chris 
1   2.5   Kitty Litter  Kevyn 
2   5.0      Bird Seed  Vinod

print(df.loc['Store 2'])
print('----')
print(df.iloc[2])
print('----')
print(df.loc['Store 1'] == df.iloc[1])
-----------------------------------------------------------------------

Cost                      5 
Item Purchased    Bird Seed 
Name                  Vinod 
Name: Store 2, dtype: object 
---- 
Cost                      5 
Item Purchased    Bird Seed 
Name                  Vinod 
Name: Store 2, dtype: object 
----
           Cost Item Purchased   Name 
Store 1  False          False  False 
Store 1   True           True   True

-----------------------------------------------------------------------