Python Pandas Library - Manipulation of Data

In this article, we’ll discuss the Python Pandas library. We’ll go through its theoretical part and also we’ll practically discuss some of its functions with Python code examples.

What is Python Pandas?

It’s a very popular library that is also open source. The role of this library is the manipulation of data, organization, and analysis. Cleaning, loading, analyzing, and transformation of data are efficiently performed by Pandas library. If you want to work with structured data(tables, spreadsheets) efficiently, then Python Pandas is for you because it provides functions and powerful data structures. For statistical analysis, data preprocessing, and data exploration, you can make use of pandas.

It also provides its user with a large number of data manipulation methods, grouping, filtering, and many more.

Data Science and Machine Learning use Python Pandas

Its reputation for making data-related tasks easy is the reason why data science and machine learning use it. It can also be used for research.

Features of Pandas Library

Label-based indexing and slicing with intelligence
Joining datasets and merging with high performance

For various file formats, it provides data input and output tools
Built-in time series functionality
Alignment of data and handling of values that are missing

Flexible reshaping and dataset pivoting

Importing Pandas

pip install pandas

This is the command that we can use to install the pandas’ library in our system. You can use the cmd of the terminal of your compiler to run it.

import pandas as pd

This is how you can use/import pandas in your file. The ‘pd’ is the short name, you can give any name of your choice.

Data Structures of Pandas

Its 2 main data structures are as follows:

Series (for one-dimensional data)
DataFrame (for two-dimensional data)

1. Series

A series is a 1D(one-dimensional) array that can store items of different datatype. Value and labels(index) are what they consist of.

Creation of Series

import pandas as pd 

data = [10, 20, 30, 40, 50] 
index = ['A', 'B', 'C', 'D', 'E'] 

series = pd.Series(data, index=index)
print(series)

Output

A    10
B    20
C    30
D    40
E    50
dtype: int64

In this example, we have two Python lists that will be used as values and labels(index). We then used the series function of the Pandas library and passed it the values and index list. The output shows the 1D array created using the series function.

Accessing Elements from Series(label and index)

print( series['C'] )
print( series[3] )

Output

30
40

In this code, we fetched the value using the label. Also, we used the indexing method to fetch the value at index 4(index starts from 0 so, 0,1,2,3).

Slicing in Series

print( series['A':'C'] )    # labels
print( series[0:2] )        # indexing

Output

A    10
B    20
C    30
dtype: int64
A    10
B    20
dtype: int64

We can specify a range which we want to take in series. We can specify it using labels and indexing. In the first coding example, all the values from A to C will be taken. While in the second example, the ending point(2) will be subtracted by 1 like (2-1) so it means values from index 0,1 will be taken.

2. Dataframe

A data frame is specified as a 2D data structure that can have columns of different datatypes.

Creation of Dataframe

dataVals = {
'Name': ['Zeeshan','Yasir','Usman'],
'Age': [26,27,18],
'City': ['Havelian','Abbottabad','Havelian']
}

df = pd.DataFrame(dataVals)    # creating dataframe
print(df)

Output

      Name  Age        City
0  Zeeshan   26    Havelian
1    Yasir   27  Abbottabad
2    Usman   18    Havelian

In this example Python code, we first created a simple Python dictionary. Then we used the data frame function and passed this dictionary to it. The keys were used as column names and the values were assigned to their specific columns.

Accessing values from Python Dataframe

print( df.head() ) 
print( df.tail(2) ) 
print( df[['Age', 'Name']] )
print( df['City'] )

Output

      Name  Age        City
0  Zeeshan   26    Havelian
1    Yasir   27  Abbottabad
2    Usman   18    Havelian

    Name  Age        City
1  Yasir   27  Abbottabad
2  Usman   18    Havelian

   Age     Name
0   26  Zeeshan
1   27    Yasir
2   18    Usman

0      Havelian
1    Abbottabad
2      Havelian
Name: City, dtype: object

The first example will show the first 5 rows by default, but we can customize it.
The second example will show the bottom 2 row as specified.
In the third example, we specified which columns we want by specifying their names inside the list.

In the fourth example code, we can see that we can easily access only one column as well.

Missing Data Handling

data= {'Name': ['Zeeshan','Yasir',None],
'Age': [None,27,None],
'City': ['Havelian',None,None]}

df= pd.DataFrame(data)

val='empty'

df.dropna()         # It will drop rows having missing values
print(df)

df.fillna(val)      # It will replace the missing values with a specific data
print(df)

df.interpolate()   # interpolation of missing values in done by this method
print(df)

Output

      Name   Age      City
0  Zeeshan   NaN  Havelian
1    Yasir  27.0      None
2     None   NaN      None
      Name   Age      City
0  Zeeshan   NaN  Havelian
1    Yasir  27.0      None
2     None   NaN      None
      Name   Age      City
0  Zeeshan   NaN  Havelian
1    Yasir  27.0      None
2     None   NaN      None

Manipulating Data

dataframe_filtered = df[df['Age'] > 22] 
dataframe_sorted = df.sort_values(by='Age') 
print(dataframe_filtered)
print(dataframe_sorted)

Output

      Name  Age        City
0  Zeeshan   26    Havelian
1    Yasir   27  Abbottabad

      Name  Age        City
2    Usman   18    Havelian
0  Zeeshan   26    Havelian
1    Yasir   27  Abbottabad

In the first example, we apply conditions that take only those rows in which ‘age’ is greater than 22. In the second example, we sort the data frame by column ‘age’.

Images of Code Examples

Conclusion

In conclusion, hope you now have an understanding of what the Python pandas library is and how it works. We’ve specified some functions that we can use with the help of the Pandas library. In our other articles, we’ve discussed the pandas Series and pandas data frame in more detail and with code examples, so do visit them as well.

Do visit our other articles to practically implement Python pandas with proper code examples. Thank you for reading this article.

Python Pandas Library – Manipulation of Data

What is Python Pandas?

Data Science and Machine Learning use Python Pandas

Features of Pandas Library

Importing Pandas

Data Structures of Pandas

1. Series

Creation of Series

Output

Accessing Elements from Series(label and index)

Output

Slicing in Series

Output

2. Dataframe

Creation of Dataframe

Output

Accessing values from Python Dataframe

Output

Missing Data Handling

Output

Manipulating Data

Output

Images of Code Examples

Conclusion

Leave a Comment Cancel Reply

Machine Learning PY