Overview

This story shares my experience troubleshooting a file validation issue and, in the process, learning how operating systems, web browsers, and web frameworks choose and interpret MIME types.

Busy? Skip the story and skim the tldr.

A day interrupted Midday, Slack direct message. A person reports having issues uploading a CSV file to a Drupal site. The site uses the handy, albeit still alpha, tablefield module to parse CSV files into tables.

Next day, morning. Another person reports the same issue. It’s time to round up the usual suspects for questioning.

The usual suspects

I obtain copies of the suspect CSV files and review them for special characters and other syntactical violence. Nothing looks amiss.

I try to upload the files on a local version of the site and receive an error: “Only files with the following extensions are allowed: csv.”

Baffled, I try a different, valid CSV file. Same error.

A module examined

To understand how the validation works, I dig into the tablefield module code. In particular, I track the error message to code from a patch that fixes a separate upload issue.

if ($file_upload->guessClientExtension() != 'csv') {
drupal_set_message(t('Only files with the following extensions are allowed: %files-allowed.', ['%files-allowed' => 'csv']), 'error');
return FALSE;
}

This code raised more questions than answers. When my CSV files were uploaded, what was the file extension that the guessClientExtension method guessed?

I change the conditional statement to output the guessedClientExtension return value. ‘XLS’ is the result.

Since XLS is an Excel file extension, I speculate that the setting which associates CSV files to open with Excel on my machine is somehow at fault. I associate CSV files to open with a text editor instead. Yet, it still fails.

What grimy details does the guessClientExtension hide about determining file types? I dive deeper.

Into the depths

I find the guessClientExtension method implementation in the Symfony framework. The PHPDoc says the function “returns the extension based on the client MIME type.” A brief look at the code confirms that comment.

public function guessClientExtension()
{
$type = $this->getClientMimeType();
$guesser = ExtensionGuesser::getInstance();
return $guesser->guess($type);
}

To understand what MIME type guessClientExtension uses, I change the tablefield module code to output the return value of getClientMimeType from an uploaded CSV. It returns ‘application/vnd.ms-excel’.

Skimming the ExtensionGuesser class betrays the default guesser to be MimeTypeExtensionGuesser. This default guesser maps MIME types to file extensions.

On line 327 of MimeTypeExtensionGuesser, I discover the culprit:

'application/vnd.ms-excel' => 'xls',

Now I understood why the original conditional statement fails, but I don’t yet understand why the MIME type is set to ‘application/vnd.ms-excel’ in the first place.

Ignorance revealed

How do browsers associate MIME Types with content when uploaded? The information is sent via a Content Type header, but what sets the value of that header?

Stack overflow provides some answers. Chrome first checks a hard-coded list, then defers to the system registry to associate a file with a MIME type.

Here is a comment from the Chromium source code:

  // We implement the same algorithm as Mozilla for mapping a file extension to
// a mime type. That is, we first check a hard-coded list (that cannot be
// overridden), and then if not found there, we defer to the system registry.
// Finally, we scan a secondary hard-coded list to catch types that we can
// deduce but that we also want to allow the OS to override.

See the source code for implementation details.

As it happens, Chrome does not ship with detection for CSV files and defers to the registry. With more help from the Chromium source code, I find the Windows registry entry for CSV files.

As expected, the content type for CSV files is set to ‘application/vnd.ms-excel’ in the Windows registry. I change the content type value to ‘text/csv’, test another upload in Drupal, and smile wryly when it validates without issue.

Imperfect fixes for an imperfect process

Changing registry values on every computer seems out of reach. Instead, I write a small patch to the tablefield module that validates files that have a CSV file extension but have the MIME type of ‘application/vnd.ms-excel’. Not elegant, but it solves the issue for now.

A note for a future me

Although RFC 4180 suggests CSV files should have a text/csv MIME type, CSV file upload validation would do better to support a broader set of MIME types.

A Stackoverflow comment suggests a range of MIME types:

text/plain
text/x-csv
application/vnd.ms-excel
application/csv
application/x-csv
text/csv
text/comma-separated-values
text/x-comma-separated-values
text/tab-separated-values

MIME type checking, of course, should be used alongside other validation techiques, such as looking at file contents.

TL;DR

On a Windows machine with Microsoft Excel installed, CSV files may have a MIME type of application/vnd.ms-excel rather than the RFC 4180 suggested type of text/csv. Symfony, in particular, will match that MIME type with an file extension of ‘xls’. That causes issues with file validation.