How to Concatenate Data Frames in Pandas (Fast and Easy)

Using pandas merge() and .join() gives you a dataset with the rows of the initial datasets mixed. The rows are typically arranged based on common attributes.

But there’s a chance that the resultant dataset may not have all the rows from the parent datasets if those rows don’t have any matches between each other.

Pandas offers a third way to work with datasets: the concat() function. It stitches together the data frames along a row or a column axis. In this guide, we’ll walk you through how to use the function to concatenate data frames.

How to Concatenate Data Frames in Pandas

To concatenate data frames is to add the second one after the first one. Using the concatenate function to do this to two data frames is as simple as passing it the list of the data frames, like so:

concatenation = pandas.concat([df1, df2])

Bear in mind that the code above assumes that the names of the columns in both data frames are the same. It concatenates the data frames along the rows.

If the names are different and the code is set to concatenate along the rows (which is considered axis 0), then the columns will also be added by default. Also, Python will fill in “NaN” values as needed.

But what if you want to perform concatenation along the columns? The concat() function allows you to pull this off just as easily.

You can call the function just like you did in the example above – the only difference being that you have to pass an “axis” parameter with the value 1 or “columns.”

Here’s what the code would look like:

concatenation = pandas.concat([df1, df2], axis="columns")

In this example, Python assumes that the rows between the data frames are the same.

However, when you’re concatenating along the columns and the rows are different, the additional rows will be added to the resultant data frame by default. Of course, as you would expect, Python will fill in “NaN” values as required.

In the next section, we’ll look at the various parameters available for concat().

Optimizing concat()

Learning the basics of the concat() function might have shown you that it’s one of the easier ways of combining data frames. It’s typically used to create a single large set so additional operations can be done.

It’s important to remember that when concat() is called, it makes a copy of the data being concatenated. Therefore, you must carefully consider whether multiple concat() calls are necessary. Using too many of them can slow down the program.

If making several calls cannot be avoided, you can consider setting the copy parameter to False.

Roles of Axes in concat()

You now know that you can specify the axis along which you want to concatenate data frames. So, what happens to the other axis when one is preferred?

Since the concatenation function always results in a set union by default – where all the data is preserved – nothing happens with the other axis.

If you’ve used .join() as an outer join and merge() before, you might have noticed this. You can force this to happen using the join parameter.

When you use the join parameter, the default is outer. However, the inner option is also available, allowing you to perform a set intersection or inner join.

But bear in mind that using an inner join in this way with the concat() function can lead to minor data loss, for the same reason that data loss occurs with regular inner joins.

The rows and columns in which the axis labels match are the only ones that will be preserved. Again, note that the join parameter only instructs pandas about handling the axes you aren’t concatenating along.

Noteworthy Parameters for concat()

Here’s a quick look at some of the most useful parameters you can use with concat():

axis: This parameter represents the axis the function will concatenate along. By default, its value is 0, which represents rows. But you can set the value to 1 to concatenate along columns. You can also use the string “index” to represent rows and “columns” to represent columns.
objs: It accepts a list or any sequence of DataFrame or Series objects you want to concatenate. Using a dictionary is also allowed, but if you use a dictionary, Python will use the keys to create a hierarchical index.
ignore_index: It accepts a Boolean value and is False by default. When set to True, the new data frame created doesn’t preserve the index values originally in the axis, as specified by the axis parameter. Hence, using this parameter can supply fresh index values.
keys: It enables you to create a hierarchical index. The most common way of using it is to create a fresh index while keeping the original indices. This way, you can tell which rows come from which data frame.
copy: This parameter notes whether you want to copy the source data. Its value is True by default, but if it’s set to False, Python will not make a copy of the data.
join: The parameter works similarly to how the how parameter works, except that it can only take the values inner and outer. By default, its value is outer, which keeps the data. However, setting it to inner removes the data that doesn’t have a match in the other dataset.

If you want to look at an exhaustive list of parameters that work with concat(), you’ll find it in the official pandas documentation.

Tips for Using concat() Correctly

Here are four things to remember when using concat(), along with examples:

#1 Work with Indices and Axes Carefully

Let’s say two data frames are holding the results of exams, like so:

firstDataFrame = pd.DataFrame({
    'name': ['I', 'J', 'K', 'L'],
    'science': [72,56,91, 83],
    'accounts': [67,95,80,77],
    'psychology': [81,71,87,86]
})secondDataFrame = pd.DataFrame({
    'name': ['M', 'N', 'O', 'P'],
    'science': [73,85,81,90],
    'accounts': [88,93,72,89],
    'psychology': [75,83,74,87]
})

Now, the easiest way to concatenate with the concat() method is to pass a list of data frames to it. As you know, the method concatenates vertically along axis 0 by default and preserves all indices.

So, using the method for simple concatenation would look like this:

pd.concat([firstDataFrame, secondDataFrame])

You might want to ignore the pre-existing indices. In this case, you can set the ignore_index argument to True. This way, the resultant data frame index will get labeled from 0 to n-1.

So, for this, the concat() method would be called this way:

pd.concat([firstDataFrame, secondDataFrame], ignore_index=True)

You can also choose to concatenate data frames horizontally. This is as simple as setting the axis argument to 1, like so:

pd.concat([firstDataFrame, secondDataFrame], axis=1)

#2 Avoid Duplicating Indices

As mentioned earlier, the concat() function saves the indices as is. But you might want to verify that the indices that result from pd.concat() don’t overlap.

Thankfully, this is easy to do. You have to set the verify_integrity argument to True; this way, pandas will raise an exception if there are duplicate indices.

Let’s take a look at a relevant example:

try:
    pd.concat([firstDataFrame, secondDataFrame], verify_integrity=True)
except ValueError as e:
    print('ValueError', e)ValueError: Overlapping indices: Int64Index([0, 1, 2, 3], dtype='int64')

#3 Use Hierarchical Indices for Easier Data Analysis

Adding a multi-level index into a concatenated data frame makes data analysis easier. Continuing with the exam result data frame example, we could add Semester 1 and Semester 2 indices in firstDataFrame and secondDataFrame, respectively.

Pulling this off is as easy as using the keys argument:

res = pd.concat([firstDataFrame, secondDataFrame], keys=['Semester 1','Semester 2'])
res

To access a particular group of values, you can use:

res.loc['Semester 1']

You can also add names to the hierarchical index using the names argument. Let’s look at how you could add the name “Class” to the index we created above:

pd.concat(
    [firstDataFrame, secondDataFrame], 
    keys=['Semester 1', 'Semester 2'],
    names=['Class', None],
)

It is also possible to reset an index, then convert it into a data column. To do this, you can use the reset_index() method like so:

pd.concat(
    [firstDataFrame, secondDataFrame], 
    keys=['Semester 1', 'Semester 2'],
    names=['Class', None],
) .reset_index(level=0)

#4 Consider Matching and Sorting Columns

One of the nice things about the concat() function is that it can reorder the columns of data frames. The function keeps the sorting order the same as the first data frame by default.

To sort a data frame alphabetically, you can set the sort argument to True, like so:

pd.concat([firstDataFrame, secondDataFrame], sort=True)

You can also make a custom sort argument like this:

custom_sort = ['science', 'accounts', 'psychology', 'name']
res = pd.concat([firstDataFrame, secondDataFrame])
res[custom_sort]