// you’re reading...

MySQL

Creating a Date Dimension Stored Procedure

Our data warehouse was getting pretty big. In fact, it was it was getting into the hundreds of gigabytes of fact data. We started looking for ways to save hard drive space. It became apparent that storing redundant varchars, datetimes, enums and other large column types was unnecessary and causing the scale constraints. Part of the solution was to convert datetime fields to dimensional key fields since it’s cheaper to store MEDIUMINTs (3 bytes) than DATETIME (8 bytes).

Think about it 100 million rows * 8 bytes per DATETIME = 762.93MB! We’ve got dozens of fact tables with > 50 million rows. Changing those columns to MEDIUMINT means 286.10MB storage or a savings of almost half a gigabyte. Ooooh, exciting!?!

To do this I had to create a table of dimensional data for date information.

The object is to capture data now to avoid functions, joins and look-ups later.
Here’s the table specs:

CREATE TABLE dim_datetime
(
date_key MEDIUMINT UNSIGNED NOT NULL AUTO_INCREMENT,
date_datetime DATETIME NOT NULL,
day_name ENUM('SUNDAY','MONDAY','TUESDAY','WEDNESDAY','THURSDAY','FRIDAY','SATURDAY'),
month_name ENUM('JANUARY','FEBRUARY','MARCH','APRIL','MAY','JUNE','JULY','AUGUST','SEPTEMBER','OCTOBER','NOVEMBER','DECEMBER'),
quarter ENUM('Q1','Q2','Q3','Q4'),
is_weekend TINYINT UNSIGNED,
is_holiday TINYINT UNSIGNED,
PRIMARY KEY(date_key),
UNIQUE KEY(date_datetime)
);

Next create the stored procedure. Notice that I’m baking in some default holidays for Christmas, New Years and 4th of July. Probably want to add or subtract depending on your use-case.

USE some_database;
DROP PROCEDURE IF EXISTS sproc_populate_dim_datetime;
DELIMITER $$
CREATE PROCEDURE sproc_populate_dim_datetime(start_date DATETIME, hours INT)
BEGIN
DECLARE new_date DATETIME;
DECLARE quarter CHAR(2);
DECLARE counter INT UNSIGNED;
DECLARE day_name VARCHAR(20);
DECLARE month_name VARCHAR(20);
DECLARE is_weekend TINYINT UNSIGNED;
DECLARE is_holiday TINYINT UNSIGNED;
SET counter :=1;
BEGIN
WHILE counter < hours DO
SET new_date := DATE_ADD(start_date, INTERVAL counter HOUR);
SET day_name := DAYNAME(new_date);
SET month_name := MONTHNAME(new_date);
SET quarter := (CASE
WHEN month_name IN ('January','February','March') THEN 'Q1'
WHEN month_name IN ('April','May','June') THEN 'Q2'
WHEN month_name IN ('July','August','September') THEN 'Q3'
WHEN month_name IN ('October','November','December') THEN 'Q4'
END);
SET is_weekend := (CASE WHEN day_name IN ('Saturday','Sunday') THEN 1 ELSE 0 END);
SET is_holiday := (CASE
WHEN (month_name IN ('January') AND DAYOFMONTH(new_date) = 1) THEN 1
WHEN (month_name IN ('July') AND DAYOFMONTH(new_date) = 4) THEN 1
WHEN (month_name IN ('December') AND DAYOFMONTH(new_date) = 25) THEN 1
ELSE 0 END);
INSERT INTO dim_datetime (`date_datetime`,`day_name`,`month_name`,`quarter`,`is_weekend`,`is_holiday`)
VALUES (new_date, day_name, month_name, quarter, is_weekend,is_holiday);
SET counter := counter + 1;
END WHILE;
END;
END$$
DELIMITER ;

Create 25 years worth of hourly keys: start_datetime, years*days*hours. Depending on the # of years you run, you'll probably want to add a day for each leap year. I just stuck with a straight 25 numeric years.

CALL sproc_populate_dim_datetime('2006-01-01 00:00:00',25*365*24);

Now let's inspect our data.
What's our start and end datetime?

SELECT MIN(date_datetime),MAX(date_datetime) FROM dim_datetime;
+---------------------+---------------------+
| MIN(date_datetime)  | MAX(date_datetime)  |
+---------------------+---------------------+
| 2006-01-01 01:00:00 | 2030-12-25 23:00:00 |
+---------------------+---------------------+
1 row in set (0.00 sec)

How many Midnight, Jan 1s do we have?

SELECT COUNT(*) FROM dim_datetime
WHERE MONTH(date_datetime) = 1
AND DAYOFMONTH(date_datetime) = 1
AND HOUR(date_datetime) = 0;
+----------+
| COUNT(*) |
+----------+
|       24 |
+----------+
1 row in set (0.15 sec)

There's your date dimension. If you use this or have some suggestions, drop me a line. Otherwise, enjoy!

Discussion

2 comments for “Creating a Date Dimension Stored Procedure”

  1. Did you take into account leap years? What about daylight savings time?

    Posted by Asaph | September 29, 2010, 9:51 am
  2. It is buried in the article text, but yes, because the range of hours is arbitrary, it is up to the user to factor in leap years.

    Daylight savings doesn’t really factor in, because the key won’t change based on user timezone preference. That part will be application dependent to pick a date_key that fits the ETL job for the fact data creation.

    Posted by Randy Melder | September 29, 2010, 10:19 am

Post a comment

Help support my site and buy a domain name at http://domainsemailhosting.com/