How to add metadata to a Pandas dataframe: a simple approach


Assume the following scenario: you have done a parameter study and agglomerated the results in a pandas dataframe. You want to save this dataset along with some additional information, e.g. the commit ID of your code which was used to obtain the results. In this case the commit ID is part of the metadata you want to save along with your data. A very simple and basic solution for this task is to use the CSV file format in combination with comments.

When you read a dataframe from a CSV file you can specify the optional argument comment=String. If a line starts with String, pandas treats it as a comment line and will ignore it. This gives you the option to write additional information in the same file as your data. However, the file format remains just plain text rather than something more complex. Below the steps / code snippets for writing and reading are given.

Write dataframe along with metadata:

I opted for the metadata appearing before the actual data since the actual data potentially consists of a lot more lines. In this case the first step is to write the metadata to a file:

def prepend_metadata(self, file_name):
	Prepend metadata of the dataframe to the output file.

	Uses '#' as comment indicating character. Currently, only the number of columns
	containing the multiindex is written as metadata.
	dataframe_file = open(file_name, 'w')
	n_index_columns = str(len(self.dataframe.index.names))
	metadata = "# n_index_columns : " + n_index_columns + '\n'

This is actually a member function of a class which has a dataframe. The result of this function is a file called file_name which contains the line “# n_index_columns : 4 \n”.

The actual data needs to be appended to the file file_name. This is achieved by giving the mode-option to the to_csv(...) member function:

	my_dataframe.to_csv(file_name, mode='a')

Setting mode='a' instructs the csv writer to operate in append mode so your metadata is not overwritten.

Read a dataframe from a CSV file with comments / metadata:

This is done by giving the comment-option to read_csv(...):

	my_dataframe = pandas.dataframe.read_csv(file_name, comment='#')

Of course, this reads only the dataframe itself. If you want to read the metadata, you need your own functions for reading them.

If you write dataframes with multiindices , this is a convenient way to store the number of columns which are part of the multiindex.

See also