Exploring Pandas Series
Bhaskar S | 12/31/2015 |
Introduction
Pandas is a general purpose Python extension module for performing data manipulation and data analysis. At the core of Pandas is the support for two data structures objects - a one-dimensional Series object and a two-dimensional DataFrame object.
The main features of the Pandas library can be summarized as follows:
Support for a one-dimensional array like structure called a Series
Support for a two-dimensional table like structure called a DataFrame
Built-in support for an index on both Series and DataFrame
Automatic alignment on index during operations for both Series and DataFrame
In this article, we will explore the one-dimensional array like data structure called Series.
Installation and Setup
To make it easy and simple, we choose the open-source Anaconda Python distribution, which includes all the necessary Python packages for science, math, & engineering computations as well as statistical data analysis.
Download the Python 3 version of the Anaconda distribution.
Extract the downloaded archive to a directory, say, /home/abc/anaconda3.
Finally, update the PATH environment variable to include /home/abc/anaconda3/bin.
Hands-on with Pandas Series
Open a terminal window and fire off the IPython Notebook.
To begin using Series, one must import the pandas module as shown below:
import numpy as np
import pandas as pd
Notice we have also imported numpy as we will be using it.
To create a simple, fixed-size Series called s1, invoke the Series constructor as shown below:
s1 = pd.Series([10, 20, 30, 40, 50])
The above creates a one-dimensional array like Series object from a list with 5 integer elements.
The following shows the screenshot of the result in IPython:
From the Fig.1 above, notice the output consists of two columns - one is the values from the provided list, while the other is the default index label starting at integer 0. Each value has an associated index label.
To access a value in a Series, one needs to use the index label specified in [] brackets. To access the value associated with the index label 3 of the Series called s1, use the syntax as shown below:
s1[3]
Index labels in Pandas start at 0.
To access the values associated with index labels 1 and 3 from the above Series called s1, use the syntax as shown below:
s1[[1,3]]
To list all the index labels from the above Series called s1, access the property index as shown below:
s1.index
One can also get the same information about the index labels from the above Series called s1, by invoking the keys() method as shown below:
s1.keys()
To list all the values from the above Series called s1, access the property values as shown below:
s1.values
To create a simple, fixed-size Series called s3 with user specified index labels, invoke the Series constructor as shown below:
s3 = pd.Series([100, 150, 200, 250, 300], index=['A', 'B', 'C', 'D', 'E'])
The following shows the screenshot of the result in IPython:
To access the value associated with index label 'C' from the above Series called s3, use the syntax as shown below:
s3['C']
One can also get the value associated with index label 'C' from the above Series called s3 by invoking the get() method as shown below:
s3.get('C')
To create a simple, fixed-size Series called s5 from a Python dictionary, invoke the Series constructor as shown below:
s5 = pd.Series({'N1': 'Alice', 'N2': ' Bob', 'N3': 'Charlie'})
The following shows the screenshot of the result in IPython:
We will now create a simple, fixed-size Series called s6 with 10 integer values and user specified index labels, as shown below:
s6 = pd.Series([65, 72, 65, 81, 56, 83, 61, 78, 65, 51], index=['R0', 'R1', 'R2', 'R3', 'R4', 'R5', 'R6', 'R7', 'R8', 'R9'])
The following shows the screenshot of the result in IPython:
To take a peek at the top few elements from the above Series called s6, invoke the head() method as shown below:
s6.head()
The following shows the screenshot of the result in IPython:
To take a peek at the last few elements from the above Series called s6, invoke the tail() method as shown below:
s6.tail()
The following shows the screenshot of the result in IPython:
To get information about the number of elements from the above Series called s6, access the property size as shown below:
s6.size
To get information about the number of elements that do not include null (NaN) from the above Series called s6, invoke the method count() as shown below:
s6.count()
To list all the unique elements from the above Series called s6, invoke the method unique() as shown below:
s6.unique()
The following shows the screenshot of the result in IPython:
As can be seen from the Fig.4 above, the Series called s6 is sorted by index labels.
To sort by the value elements in the above Series called s6, invoke the method order() as shown below:
s6 = s6.order()
The order() method returns a new Series that is ordered by the values.
The following shows the screenshot of the result in IPython:
To list all the unique elements along with their counts from the above Series called s6, invoke the method value_counts() as shown below:
s6.value_counts()
The following shows the screenshot of the result in IPython:
To re-sort by index labels in the above Series called s6, invoke the method sort_index() as shown below:
s6 = s6.sort_index()
The sort_index() method returns a new Series that is ordered by the index labels.
The following shows the screenshot of the result in IPython:
To look-up the value associated with the index label 'R5' from the above Series called s6, one can use the property loc as shown below:
s6.loc['R5']
One can also look-up the same above value using the in-built position index (which starts with a zero for the first value) from the above Series called s6, using the property iloc as shown below:
s6.iloc[5]
To look-up more the values associated with the index labels 'R1', 'R3', and 'R5' from the above Series called s6, one can use the property loc as shown below:
s6.loc[['R1', 'R3', 'R5']]
One can also look-up the same above values using the in-built position indices from the above Series called s6, using the property iloc as shown below:
s6.iloc[[1, 3, 5]]
One can use slicing operation with position indices to select values from a Series.
The general syntax to access the elements from a Series called X is as shown below:
X[start:end:step]
where start is the starting position index, end is the index one above the last position index we desire, and step is the increment to the next position index to access.
To access all the elements 1st through 5th in steps of 2 from the above Series called s6, use the syntax as shown below:
s6[0:5:2]
The following shows the screenshot of the result in IPython:
To access all the elements starting with the 6th from the above Series called s6, use the syntax as shown below:
s6[5:]
The following shows the screenshot of the result in IPython:
To access all the elements starting with the 3rd through last in steps of 2 from the above Series called s6, use the syntax as shown below:
s6[2::2]
The following shows the screenshot of the result in IPython:
To access all the elements starting with the 4th and going backwards to the first from the above Series called s6, use the syntax as shown below:
s6[3::-1]
The following shows the screenshot of the result in IPython:
To access all the elements starting with the last and going backwards to the first in steps of 3 from the above Series called s6, use the syntax as shown below:
s6[::-3]
The following shows the screenshot of the result in IPython:
We will now create two Series called s7 and s8 as shown below:
keys1 = ['AAPL', 'COST', 'FB', 'GOOG', 'MSFT']
vals1 = [108.75, 162.50, 107.25, 776.50, 56.50]
s7 = pd.Series(vals1, index=keys1)
keys2 = ['AAPL', 'FB', 'GOOG', 'MSFT', 'NFLX']
vals2 = [107.25, 105.50, 770.50, 53.25, 118.75]
s8 = pd.Series(vals2, index=keys2)
The following shows the screenshot of the result in IPython:
One can perform basic arithmetic operations such as addition, subtraction, multiplication, or division on the two Series called s7 and s8. The following example is the subtraction operation:
s7 - s8
The following shows the screenshot of the result in IPython:
Here is another example of dividing the Series called s6 by 10:
s6 / 10
The following shows the screenshot of the result in IPython:
One can also perform statistical operations such as finding the mean, the variance, and the standard deviation on the above Series called s6 as shown below:
s6.mean()
s6.var()
s6.std()
The following shows the screenshot of the result in IPython:
One can filter and select from a given Series using logical expressions. For example, one could select all the values that are greater than 65 from the above Series called s6 as shown below:
s6[s6 > 65]
Another example, one could select all the values that are greater than 61 and less than 80 from the above Series called s6 as shown below:
s6[(s6 > 61) & (s6 < 80)]
The following shows the screenshot of the above two results in IPython:
References