Pandas DataFrames are data structures that hold data in two dimensions, similar to a table in SQL or an Excel spreadsheet, but faster and more powerful. They have rows and columns and also have labels corresponding to these rows and columns.
DataFrames are an invaluable tool that form the bedrock of machine learning, data science, scientific computing, and virtually every other data-intensive field.
In this Python guide, we'll see how you can create and manipulate data in Pandas DataFrames.
The Basics of Pandas DataFrames
To work with DataFrames, you must first import Pandas, like so:
>>> import pandas as pd
Let's understand the basics of DataFrames with the help of an example. Let's say you want to use Pandas to analyze job candidates. Besides tracking their details, including their names, locations, and ages, you also want to track their score on the company-assigned programming test.
name |
city |
age |
test-score |
|
101 |
John |
Oslo |
25 |
88.0 |
102 |
Lisa |
Bern |
32 |
79.0 |
103 |
Gene |
Prague |
35 |
81.0 |
104 |
Ling |
Tokyo |
29 |
80.0 |
105 |
Bruce |
Zurich |
37 |
68.0 |
106 |
Alan |
Copenhagen |
34 |
61.0 |
107 |
Chang |
Tokyo |
32 |
84.0 |
As you can see, the first row in this table has column labels that help you track the details. Also, the first column has row labels, which are numbers. The other cells in the table are filled with other data values.
Now that the context is set and we know what DataFrame we need to create, we can look into how to do it.
There are many ways to approach creating a Pandas DataFrame. The most prominent way of doing it is to supply the labels, data, and other details with a DataFrame constructor.
Several options are also available to you in terms of passing the data. Besides passing the data as a dictionary or Pandas Series instance, you can also pass it as a two-dimensional tuple, NumPy array, or list. You are not limited to these methods – you can use one of the many data types Python provides to pass the data.
Let's say you want to use a dictionary to pass the data. Here's what the code would look like:
>>> data = { ... 'name': ['John', 'Lisa', 'Gene', 'Ling', 'Bruce', 'Alan', 'Chang'], ... 'city': ['Oslo', 'Bern', 'Prague', 'Tokyo', ... 'Manchester', 'Copenhagen', 'Hong Kong'], ... 'age': [41, 28, 33, 34, 38, 31, 37], ... 'test-score': [88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0] ... } >>> row_labels = [101, 102, 103, 104, 105, 106, 107]
The "data" variable is a built-in Python variable that refers to the dictionary holding your data. The "row_labels" variable does what you expect it to do – it holds the labels of the rows.
Here's how you would create a DataFrame:
>>> df = pd.DataFrame(data=data, index=row_labels) >>> df name city age test-score 101 John Oslo 41 88.0 102 Lisa Bern 28 79.0 103 Gene Prague 33 81.0 104 Ling Tokyo 34 80.0 105 Bruce Manchester 38 68.0 106 Alan Copenhagen 31 61.0 107 Chang Hong Kong 37 84.0
With this, the DataFrame has been created. The "df" variable holds the reference to the DataFrame.
Working with a Python DataFrame
Although the one we've created isn't, DataFrames can be quite large. Pandas has two commands, .head() and .tail(), that allow you to see the first and last few items in a DataFrame.
Here's how you use them:
>>> df.head(n=2) # Outputs the first 2 rows >>> df.tail(n=2) # Outputs the final 2 rows
The "n" value is a parameter that tells the commands the number of roles to show.
It is also possible to access a column in a DataFrame. You can do this in the same way you would derive a value from a dictionary:
>>> cities = df['city'] >>> cities # Returns the column
This is by far the most convenient method to get a column from a Pandas DataFrame. But in some cases, the column's name is identical to a Python identifier.
In these cases, you can use the dot notation to access it in the same way you would access a class instance's attribute:
>>> df.city # This will have the same result as the previous code
In the code above, we extract the column corresponding with the "city" label that holds the locations of all the job candidates.
Interestingly, every column in a Pandas DataFrame is an instance of pandas.Series. This is a data structure that holds one-dimensional data and its labels.
You can also get individual items from a Series object the same way you would with a dictionary. The idea is to use the label as a key, like so:
>>> cities[102]
To access an entire row, you can use the .loc[] accessor like so:
>>> df.loc[103] name Gene city Prague age 33 test-score 81 Name: 103, dtype: object
Different Ways of Creating Pandas DataFrames
This section will cover the four most well-known methods of creating DataFrames. But you can find the details of all the various methods of creating DataFrames in Pandas in the official documentation.
Before you try running the code of any of the examples below, you will need to import Pandas along with NumPy:
>>> import numpy as np >>> import pandas as pd
#1 With Dictionaries
When creating a DataFrame with a dictionary, its keys become the column labels, and the values become the columns. Here's how you would define one:
>>> d = {'x': [1, 2, 3], 'y': np.array([2, 4, 8]), 'z': 100} >>> pd.DataFrame(d) x y z 0 1 2 100 1 2 4 100 2 3 8 100
You can use the columns parameter to control the order of the column order. The index parameter does the same thing but for the rows.
>>> pd.DataFrame(d, index=[100, 200, 300], columns=['z', 'y', 'x']) z y x 100 100 2 1 200 100 4 2 300 100 8 3
#2 With Lists
Pandas also supplies the flexibility of using a list of dictionaries to create a DataFrame:
>>> l = [{'x': 1, 'y': 2, 'z': 100}, ... {'x': 2, 'y': 4, 'z': 100}, ... {'x': 3, 'y': 8, 'z': 100}] >>> pd.DataFrame(l) x y z 0 1 2 100 1 2 4 100 2 3 8 100
Further, you also have the flexibility of using a nested list to create DataFrames. But it's best to explicitly specify the labels of the rows, columns, or both when doing this, like so:
>>> l = [[1, 2, 100], ... [2, 4, 100], ... [3, 8, 100]] >>> pd.DataFrame(l, columns=['x', 'y', 'z']) x y z 0 1 2 100 1 2 4 100 2 3 8 100
It's noteworthy that you can also use a list of tuples instead of dictionaries as we have above.
#3 With NumPy Arrays
In the same way you pass a two-dimensional list to a DataFrame constructor, you can also pass a NumPy array:
>>> arr = np.array([[1, 2, 100], ... [2, 4, 100], ... [3, 8, 100]]) >>> df_ = pd.DataFrame(arr, columns=['x', 'y', 'z']) >>> df_ x y z 0 1 2 100 1 2 4 100 2 3 8 100
It might seem as if this example is the same as the nested list example we discussed. There is one advantage to this approach. You can optionally use the "copy" parameter.
The NumPy array's data isn't copied by default, leading to the original data from the array being assigned to the DataFrame. So, if you modify the array, the DataFrame will also change:
>>> arr[0, 0] = 1000 # Changing the first item of the array >>> df_ # Checking if the DataFrame is modified x y z 0 1000 2 100 1 2 4 100 2 3 8 100
If you want your DataFrame to be created with a copy of the values from arr and not arr, mention "copy=True" in the DataFrame constructor. This way, the DataFrame will remain the same even if arr is modified.
#4 From Files
Pandas allows you to save and load data and labels to and from various file types, including but not limited to CSV, JSON, and SQL. To save your DataFrame to a CSV file, you can run the following:
>>> df.to_csv('data.csv')
A CSV file, "data.csv," will appear in your working directory. You can also read a CSV file with data like so:
>>> pd.read_csv('data.csv', index_col=0)
This code will give you the same DataFrame as the first one we created in this post. Bear in mind that the "index_col=0" part of the code indicates that the row labels are located in the first column of the file.
Accessing and Modifying Data in DataFrames
Fetching Data with Accessors
The .loc[] accessor allows you to fetch rows and columns by label, like so:
>>> df.loc[10] name John city Oslo age 41 test-score 88 Name: 10, dtype: object
You can also use the .iloc[] accessor to retrieve a row or column by its integer index.
>>> df.iloc[0] # Returns first row name John city Oslo age 41 test-score 88 Name: 10, dtype: object
Using either of these accessors is helpful in most circumstances. Python has two other accessors: .at[] and .iat[]. But since .loc[] and .iloc[] support slicing and NumPy-style indexing, you can use them to access columns like so:
>>> df.loc[:, 'city'] >>> df.iloc[:, 1]
It is also possible to write slices for lists or arrays instead of indices to retrieve rows and columns:
>>> df.loc[11:15, ['name', 'city']] # Slicing to get rows >>> df.iloc[1:6, [0, 1]] # Lists to get columns
As you'd know, you can slice lists, tuples, and NumPy arrays and skip values as required. The nice thing about .iloc[] is that it supports this listing.
>>> df.iloc[1:6:2, 0] 11 Lisa 13 Ling 15 Alan Name: name, dtype: object
The slicing we've done above begins with the second row, stops before the row with index 6, and skips every other row.
You don't necessarily have to use the slicing construct to slice – you can also use the slice() Python class to do it. Of course, the pd.IndexSlice[] or numpy.s [] are also available for these purposes:
>>> df.iloc[slice(1, 6, 2), 0] >>> df.iloc[np.s_[1:6:2], 0] >>> df.iloc[pd.IndexSlice[1:6:2], 0]
Though all these approaches accomplish the same thing, you might find it easier to use one of them according to your situation.
If you want to fetch a single value, you can use .at[] and .iat[], like so:
>>> df.at[12, 'name'] >>> df.iat[2, 0] # Output of both is 'Gene'
Modifying Data with Accessors
You can modify parts of a DataFrame by passing a Python sequence or NumPy array, like so:
>>> df.loc[:13, 'test-score'] = [40, 50, 60, 70] # Modifying the first four items in the test-score column >>> df.loc[14:, 'test-score'] = 0 # Setting 0 to the remaining columns >>> df['test-score'] 10 40.0 11 50.0 12 60.0 13 70.0 14 0.0 15 0.0 16 0.0 Name: test-score, dtype: float64
You can also use negative indices with .iloc[] to modify data:
>>> df.iloc[:, -1] = np.array([88.0, 79.0, 81.0, 80.0, 68.0, 61.0, 84.0])
Inserting and Deleting Data in DataFrames
Modifying Rows
Let's say you want to add a new person to the DataFrame. Here's how you'd do it by creating a new Series object:
>>> rich = pd.Series(data=['Rich', 'Boston', 34, 79], ... index=df.columns, name=17) >>> rich name Rich city Boston age 31 test-score 79 Name: 17, dtype: object >>> rich.name 17
You will need to use "index=df.columns" because the object has labels corresponding to the labels in the df DataFrame.
To add this new candidate to the end of df, you can use .append() like so:
>>> df = df.append(rich)
If you need to delete this new row later, you can do it with the .drop() method:
>>> df = df.drop(labels=[17])
The .drop() method returns the DataFrame with the specified rows removed, but you can use "inplace=True" to get None as the return value.
Modifying Columns
Let's say there's a second test, and you need to add every candidate's scores to the table. Here's how you could append a new column with the scores:
>>> df['second-test-score'] = np.array([71.0, 95.0, 88.0, 79.0, 91.0, 91.0, 80.0])
If you've used dictionaries in Python, you might find this method of inserting columns is familiar. This code will add a new "second-test-score" column to the right of the "test-score" column.
It's also possible for you to assign the same value across the column. All you have to do is run the following:
>>> df['total-score'] = 0.0
Though this insertion method is straightforward, it does not allow you to insert columns at a specific position. If you need to insert the column at a specific position in the table, you can use the .insert() method, like so:
>>> df.insert(loc=4, column='soft-skills-score', ... value=np.array([86.0, 81.0, 78.0, 88.0, 74.0, 70.0, 81.0])) # Makes this new column the fourth column in the table
To delete a column from the DataFrame, you can use the del statement just like you would with a Python dictionary.
>>> del df['soft-skills-score']
Another feature of DataFrames that is similar to dictionaries is that DataFrames work with .pop(). This method removes the column you specify and returns it.
In other words, you could also use "df.pop('soft-skills-score')" instead of del.
But you might be in a position where you need to get rid of more than one column from your table. In this case, you can use the .drop() function. All you need to do is specify the labels of the columns you want to remove.
Additionally, you will need to supply the argument "axis=1" when you intend to remove columns. But bear in mind that the method will return the DataFrame without the specified columns. If you want to see the specified columns, you can pass "inplace=True."