The complicated nature of dates in CSV files

Dates can be represented in hundreds of different ways. As such they complicated for humans and computers to figure out. Just give someone who doesn’t live in America US date format (mm/dd/yy) and they’ll scratch their head for a while trying to figure out what the date is. Likewise for an American trying to figure out the date format the rest of the world uses (dd/mm/yy).

When it comes to CSV files SQLizer sometimes needs to detect dates and the correct date format. Doing this on a four-line CSV file isn’t too hard. But files containing millions of lines of data are another thing. We never take shortcuts on this. We don’t base type decisions on a sample of the data or the first N rows. We check the whole file before deciding what types to use for each column. And SQLizer does a pretty good job of detecting and handling them properly.

How do we do it?

Here are a few examples. Consider the following CSV file:

ID, Date Column
1, 01/01/2017
2, 02/01/2017
3, 12/01/2017

It’s clear to any human that the second column should be a date column. But CSV data files don’t specify the types of data being used in each column. But databases do. So SQLizer needs to detect the columns that should be dates and sets up the right column type, which it does:

CREATE TABLE example_dates (
    `ID` INT,
    `Date_Column` DATETIME
);
INSERT INTO example_dates VALUES (1,'2017-01-01 00:00:00');
INSERT INTO example_dates VALUES (2,'2017-01-02 00:00:00');
INSERT INTO example_dates VALUES (3,'2017-01-12 00:00:00');

Notice how the dates in the original file could have been either US dates (mm/dd/yyyy) or the more internationally used dd/mm/yyyy format. But SQLizer assumed it was dd/mm/yyyy.

Now consider this CSV file:

ID, Date Column
1, 01/01/2017
2, 02/01/2017
3, 12/01/2017
4, 12/13/2017

We’ve added a fourth row, and when you look at that row, it’s clear that the dates are in US format: mm/dd/yyyy. If we feed that through SQLizer, it works out that all the dates in that column should be US dates.

CREATE TABLE example_dates (
    `ID` INT,
    `Date_Column` DATETIME
);
INSERT INTO example_dates VALUES (1,'2017-01-01 00:00:00');
INSERT INTO example_dates VALUES (2,'2017-02-01 00:00:00');
INSERT INTO example_dates VALUES (3,'2017-12-01 00:00:00');
INSERT INTO example_dates VALUES (4,'2017-12-13 00:00:00');

A similar thing is happening for different time formats. Consider the following CSV file:

ID, Date Column
1, 01/01/2017 11:30
2, 02/01/2017 10:20
3, 12/01/2017 12:12

SQLizer assumes the dates are in 24 hour format, so it’s 11 AM and 12 PM etc.

CREATE TABLE example_dates (
    `ID` INT,
    `Date_Column` DATETIME
);
INSERT INTO example_dates VALUES (1,'2017-01-01 11:30:00');
INSERT INTO example_dates VALUES (2,'2017-01-02 10:20:00');
INSERT INTO example_dates VALUES (3,'2017-01-12 12:12:00');

But if we add AM and PM specifiers to those dates…

ID, Date Column
1, 01/01/2017 11:30 PM
2, 02/01/2017 10:20 AM
3, 12/01/2017 12:12 PM

SQLizer picks those up and uses them:

CREATE TABLE example_dates (
    `ID` INT,
    `Date_Column` DATETIME
);
INSERT INTO example_dates VALUES (1,'2017-01-01 23:30:00');
INSERT INTO example_dates VALUES (2,'2017-02-01 10:20:00');
INSERT INTO example_dates VALUES (3,'2017-12-01 12:12:00');

Finally, having interpreted the second column as dates, if SQLizer spots some invalid data, like this (The 32nd of Undecember):

ID, Date Column
1, 01/01/2017 
2, 02/01/2017 
3, 12/01/2017 
4, 32/13/2017

It will simply assume that they weren’t supposed to be dates after all, and make the second column a text column:

CREATE TABLE example_dates (
    `ID` INT,
    `Date_Column` VARCHAR(11) CHARACTER SET utf8
);
INSERT INTO example_dates VALUES (1,' 01/01/2017');
INSERT INTO example_dates VALUES (2,' 02/01/2017');
INSERT INTO example_dates VALUES (3,' 12/01/2017');
INSERT INTO example_dates VALUES (4,' 32/13/2017');

SQLizer does this for all files, ensuring dates are represented correctly so that the SQL output is correct and ready to be implemented in your database.

SQLizer converts files into SQL databases. With an API to help automate conversion tasks, it’s one of the best ways to transfer data between databases. Convert a file now.

How do we do it?

More from The Official SQLizer blog...