The Series Data Structure
import pandas as pd
Getting inline help
pd.Series?
-----------------------------------------------------------------------
Init signature:
pd.Series(data=None, index=None, dtype=None, name=None, copy=False, fastpath=False)
Docstring:
One-dimensional ndarray with axis labels (including time series).
Labels need not be unique but must be a hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN).
Operations between Series (+, -, /, *, **) align values based on their associated index values-- they need not be the same length. The result index will be the sorted union of the two indexes.
Parameters
----------
data : array-like, dict, or scalar value
Contains data stored in Series
.. versionchanged :: 0.23.0
If data is a dict, argument order is maintained for Python 3.6 and later.
index : array-like or Index (1d)
Values must be hashable and have the same length as `data`.
Non-unique index values are allowed. Will default to
RangeIndex (0, 1, 2, ..., n) if not provided. If both a dict and index
sequence are used, the index will override the keys found in the
dict.
dtype : numpy.dtype or None
If None, dtype will be inferred
copy : boolean, default False
Copy input data
File: ~/anaconda3/lib/python3.6/site-packages/pandas/core/series.py Type: type
animals = ['Tiger', 'Bear', 'Moose']
pd.Series(animals)
-----------------------------------------------------------------------
0 Tiger
1 Bear
2 Moose
dtype: object
numbers = [1, 2, 3]
print(pd.Series(numbers))
num = [1,2,3.4,4,4]
print(pd.Series(num, ['a','b','c','d','5']))
-----------------------------------------------------------------------
0 1
1 2
2 3
dtype: int64
a 1.0
b 2.0
c 3.4
d 4.0
5 4.0
dtype: float64
# For string data it uses None when there is None
animals = ['Tiger', 'Bear', None]
pd.Series(animals)
-----------------------------------------------------------------------
0 Tiger
1 Bear
2 None
dtype: object
# For number data it uses NaN (Not a Number) when there is None or no data
numbers = [1, 2, None]
x = pd.Series(numbers)
print(x)
print(x[0] == 1.0)
print(x.loc[0] == x.iloc[0])
-----------------------------------------------------------------------
0 1.0
1 2.0
2 NaN
dtype: float64
True
True
import numpy as np
print(np.nan == None) # NaN is not None
-----------------------------------------------------------------------
False
np.nan == np.nan # NaN is not equal to NaN
-----------------------------------------------------------------------
False
print(np.isnan(np.nan)) # only way to check for NaN is by using isnan()
print(np.isnan(x[2]))
-----------------------------------------------------------------------
True
True
sports = {'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'}
s = pd.Series(sports) # instead of series we gave dictionary
s
-----------------------------------------------------------------------
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
dtype: object
print(s.index) # returns all indexes; it is of type pd.indexes.base.Index
type(s.index)
-----------------------------------------------------------------------
Index(['Archery', 'Golf', 'Sumo', 'Taekwondo'], dtype='object')
pandas.indexes.base.Index
s = pd.Series(['Tiger', 'Bear', 'Moose'], index=['India', 'America', 'Canada'])
# Number of data and indexes must match
print(s)
-----------------------------------------------------------------------
India Tiger
America Bear
Canada Moose
dtype: object
t = pd.Series([‘Tiger’, ‘Bear’, ‘Moose’], index=[‘India’, ‘America’]) # this will throw error
u = pd.Series([‘Tiger’, ‘Bear’], index=[‘India’, ‘America’, ‘Canada’]) # this will throw error
sports = {'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': None,
'Taekwondo': np.nan} # None and NaN can co-exist together
a = pd.Series(sports)
# sports itself can make a series with indexes set
# as per key values in dictionary
print(a)
# when additional index is provided along with a disctionary type element in
# data a new series is created with the provided index only. In this case
# Archery will not be in the new series and Hockey which don't have any data
# gets added with NaN
s = pd.Series(sports, index=['Golf', 'Sumo', 'Hockey'])
print(s)
np.isnan(s['Hockey'])
-----------------------------------------------------------------------
Archery Bhutan
Golf Scotland
Sumo None
Taekwondo NaN
dtype: object
Golf Scotland
Sumo None
Hockey NaN
dtype: object
True
Querying a Series
sports = {'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'}
s = pd.Series(sports)
s
-----------------------------------------------------------------------
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
dtype: object
s.iloc[3]
-----------------------------------------------------------------------
'South Korea'
s.loc['Taekwondo']
-----------------------------------------------------------------------
'South Korea'
s[3]
-----------------------------------------------------------------------
'South Korea'
s['Golf']
-----------------------------------------------------------------------
'Scotland'
sports = {99: 'Bhutan',
100: 'Scotland',
101: 'Japan',
102: 'South Korea'}
s = pd.Series(sports)
s
-----------------------------------------------------------------------
99 Bhutan
100 Scotland
101 Japan
102 South Korea
dtype: object
print(s.iloc[0])
print(s.loc[99])
print(s[99])
print(s.iloc[0] == s.loc[99])
print(s.iloc[0] == s[99])
# s[0] #This won't call s.iloc[0] as one might expect, it generates an error
# print(s.iloc[99]) # throws exception
-----------------------------------------------------------------------
Bhutan
Bhutan
Bhutan
True
True
s = pd.Series([100.00, 120.00, 101.00, 3.00])
s
-----------------------------------------------------------------------
0 100.0
1 120.0
2 101.0
3 3.0
dtype: float64
total = 0
for item in s:
total+=item
print(total)
-----------------------------------------------------------------------
324.0
import numpy as np
total = np.sum(s)
print(total)
-----------------------------------------------------------------------
324.0
#this creates a big series of random numbers
np.random.randint?
-----------------------------------------------------------------------
randint(low, high=None, size=None)
# Return random integers from low (inclusive) to high (exclusive).
# Return random integers from the "discrete uniform" distribution in the
# "half-open" interval [ low, high). If high is None (the default),
# then results are from [0, low).
Parameters
----------
low : int
Lowest (signed) integer to be drawn from the distribution (unless
high=None, in which case this parameter is the highest such
integer).
high : int, optional
If provided, one above the largest (signed) integer to be drawn
from the distribution (see above for behavior ifhigh=None).
size : int or tuple of ints, optional
Output shape. If the given shape is, e.g.,(m, n, k), then
m * n * ksamples are drawn. Default is None, in which case a
single value is returned.
#
Returns
-------
out : int or ndarray of ints
size-shaped array of random integers from the appropriate
distribution, or a single such random int ifsizenot provided.
s = pd.Series(np.random.randint(0,1000,10000))
print(s.head())
print(s.tail())
len(s)
-----------------------------------------------------------------------
0 615
1 655
2 505
3 505
4 566
dtype: int64
9995 662
9996 710
9997 694
9998 87
9999 701
dtype: int64
10000
%%timeit -n 100
summary = 0
for item in s:
summary+=item
-----------------------------------------------------------------------
881 µs ± 50.7 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 100
summary = np.sum(s)
-----------------------------------------------------------------------
100 loops, best of 3: 187 µs per loop
summary = np.sum(s)
print("Total: {} Average: {}".format(summary, summary/len(s)) )
s+=2 #adds two to each item in s using broadcasting
s.head()
-----------------------------------------------------------------------
Total: 5024510 Average: 502.451
0 617
1 657
2 507
3 507
4 568
dtype: int64
%%timeit -n 100
for label, value in s.iteritems():
s.set_value(label, value+2)
# Another way of iterating over all data and updating,
# updating in this way is slow
-----------------------------------------------------------------------
82.5 ms ± 2.49 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
for label, value in s.iteritems():
s.loc[label]= value+2
# updating in this way is slower; its the WORST
-----------------------------------------------------------------------
10 loops, best of 3: 1.58 s per loop
%%timeit -n 10
s = pd.Series(np.random.randint(0,1000,10000))
s+=2
# broadcasting is the fastest way to update; ITS THE BEST
-----------------------------------------------------------------------
10 loops, best of 3: 427 µs per loop
s = pd.Series([1, 2, 3])
s.loc['Animal'] = 'Bears' #adds new item to the series
s
-----------------------------------------------------------------------
0 1
1 2
2 3
Animal Bears
dtype: object
original_sports = pd.Series({'Archery': 'Bhutan',
'Golf': 'Scotland',
'Sumo': 'Japan',
'Taekwondo': 'South Korea'})
cricket_loving_countries = pd.Series(['Australia',
'Barbados',
'Pakistan',
'England'],
index=['Cricket',
'Cricket',
'Cricket',
'Cricket'])
# correct way of appending a series
all_countries = original_sports.append(cricket_loving_countries)
print(all_countries)
# dont give desiered output
new_all_countries = original_sports + cricket_loving_countries
print(new_all_countries)
-----------------------------------------------------------------------
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
Archery NaN
Cricket NaN
Cricket NaN
Cricket NaN
Cricket NaN
Golf NaN
Sumo NaN
Taekwondo NaN
dtype: object
original_sports
-----------------------------------------------------------------------
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
dtype: object
cricket_loving_countries
-----------------------------------------------------------------------
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
all_countries
-----------------------------------------------------------------------
Archery Bhutan
Golf Scotland
Sumo Japan
Taekwondo South Korea
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
all_countries.loc['Cricket']
# this can retrieve multiple rows; like a hash table that allow conflicts
-----------------------------------------------------------------------
Cricket Australia
Cricket Barbados
Cricket Pakistan
Cricket England
dtype: object
The DataFrame Data Structure
import pandas as pd
purchase_1 = pd.Series({'Name': 'Chris',
'Item Purchased': 'Dog Food',
'Cost': 22.50})
purchase_2 = pd.Series({'Name': 'Kevyn',
'Item Purchased': 'Kitty Litter',
'Cost': 2.50})
purchase_3 = pd.Series({'Name': 'Vinod',
'Item Purchased': 'Bird Seed',
'Cost': 5.00})
df = pd.DataFrame([purchase_1, purchase_2, purchase_3], index=['Store 1', 'Store 1', 'Store 2'])
print(df.head())
new_df = pd.DataFrame([purchase_1, purchase_2, purchase_3])
print(new_df.head())
-----------------------------------------------------------------------
Cost Item Purchased Name
Store 1 22.5 Dog Food Chris
Store 1 2.5 Kitty Litter Kevyn
Store 2 5.0 Bird Seed Vinod
Cost Item Purchased Name
0 22.5 Dog Food Chris
1 2.5 Kitty Litter Kevyn
2 5.0 Bird Seed Vinod
print(df.loc['Store 2'])
print('----')
print(df.iloc[2])
print('----')
print(df.loc['Store 1'] == df.iloc[1])
-----------------------------------------------------------------------
Cost 5
Item Purchased Bird Seed
Name Vinod
Name: Store 2, dtype: object
----
Cost 5
Item Purchased Bird Seed
Name Vinod
Name: Store 2, dtype: object
----
Cost Item Purchased Name
Store 1 False False False
Store 1 True True True
-----------------------------------------------------------------------