talideon.com

Do these glasses make me look cool?

May 25, 2006 at 9:05AM Sick SQL: Getting the IDs of all the entries in the past seven days with entries

Here’s a bit of SQL (albeit simplified to remove needless details) that I used to use here to generate the frontpage until its slowness, even using query caching, drove me to simply list the last x entries instead. Now keep in mind that what this does is find all the entries in the last seven days that have had entries posted on them, and those days do not need to be consecutive, which is what makes this query so horrifyingly fascinating in my eyes.

SELECT   E2.id
FROM     entries AS E1, entries AS E2
WHERE    E1.published >= E2.published
GROUP BY E2.published
HAVING   COUNT(DISTINCT TO_DAYS(E1.published)) <= 7

Frightened? You should be. Absolutely nothing can be done, best as I can tell, to make this query more efficient, and I haven’t been able to discover an alternative way of doing it that doesn’t entail resorting to cursors. The best I’ve been able to do is make an index that consists of the published and id columns.

Here’s what MySQL has to say about it:

MySQL’s attempt at execution plan
Select Type Table Type Key Rows Extra
SIMPLE E1 index ix_published 144 Using index; Using temporary; Using filesort
SIMPLE E2 index ix_published 144 Using where; Using index

Urgh! That’s correlated subquery nasty! Mind that there are 144 rows in the table, so it’s scanning the whole bloody thing! It was ran against an old backup of the DB I use locally for testing.

Now, if you think you’re hard, I challenge you to come up with a better way of doing this. Buggered if I know! There’s no reward here except my undying admiration.

Update (2006-06-08): The published column is a DATETIME, not a DATE, hence the use of TO_DAYS().

Technorati Search Technorati Search Irish Bloggers

Comments

1 On May 25, 2006 at 20:39, John Handelaar wrote:

Seems that this is a massively oversimplified example, as it suggests strongly that:

SELECT  id
FROM    entries
WHERE   published > unix_timestamp(now()) - 604800

...would give you the exact same result. Which I’m sure isn’t what you’re asking.

2 On May 25, 2006 at 20:40, John Handelaar wrote:

grr hateful autoformatting without warnings.

3 On May 25, 2006 at 23:43, Keith wrote:

Yeah, I changed my formatting code recently and haven’t got around to including a note or a preview button yet as I’m rewriting other chunks of it to include tagging, better caching, merging in my linklog, a comments feed, and a bunch of other things. The hastily written manual for the formatter is here. Sometime I wonder if I’d might as well just get rid of all formatting except for links and emphasis in the comments...

I really, really wish that did have the right effect, but it doesn’t. You see, the days aren’t necessarily contiguous, and that’s the key point here. The first day could be yesterday, the next one last week, and so on. It collects together the entries of the most recent seven days that have entries, ergo the horrible complexity.

Formatting fixed up too.

4 On June 8, 2006 at 15:00, Greg Gaughan wrote:

It doesn’t make sense to SELECT E2.id if you’re only GROUPing by E2.published (which one should it return when there are many?).

Also the self-join query looks crazy.

I think you need something like:

SELECT  id
FROM    entries
WHERE   published IN (
    SELECT  DISTINCT published
    FROM    entries
    ORDER BY published DESC
    LIMIT 7
)

5 On June 8, 2006 at 19:06, Keith wrote:

Oh! Somebody with the same surname as me! It’s not often I run across people like that online.

Your query won’t work on MySQL because it doesn’t allow LIMIT in subselects. That’s the reason for the self-join, which I agree does look somewhat crazy. All that complexity is an attempt to simulate what you’re doing with the query you gave.

The SELECT E2.id might be depending on a happy accident of the way MySQL works. I’ll need to check that out. But I still can’t think of a better way to do this that doesn’t entail breaking it into two queries.

6 On June 9, 2006 at 4:36, Rowan Nairn wrote:

Call me crazy, but how about this: First off, make a new indexed DATE column that holds the day of each entry. (Is that cheating?) Then go:

SELECT id
FROM entries
WHERE day >= (
	SELECT IF(MAX(day) IS NULL, 0, MAX(day))
	FROM entries
	WHERE day < (
		SELECT IF(MAX(day) IS NULL, 0, MAX(day))
		FROM entries
		WHERE day < (
			SELECT IF(MAX(day) IS NULL, 0, MAX(day))
			FROM entries
			WHERE day < (
				SELECT IF(MAX(day) IS NULL, 0, MAX(day))
				FROM entries
				WHERE day < (
					SELECT IF(MAX(day) IS NULL, 0, MAX(day))
					FROM entries
					WHERE day < (
						SELECT IF(MAX(day) IS NULL, 0, MAX(day))
						FROM entries
						WHERE day < (
							SELECT IF(MAX(day) IS NULL, 0, MAX(day))
							FROM entries
						)
					)
				)
			)
		)
	)
)

7 On June 9, 2006 at 4:47, Rowan Nairn wrote:

Actually you only need the IF in the outermost subquery, right?

8 On June 9, 2006 at 15:56, Keith wrote:

Thanks for trying, Rowan. That’s close, but no cigar. There’s one problem with it: It’s hardwired for seven days, but the challenge it to solve it for x days. Of course, a way around this is to dynamically construct the query for whatever number of days you want. Your code is essentially an unrolled version of what I was trying to do with the nasty self join.

9 On June 9, 2006 at 18:58, Rowan Nairn wrote:

But mine optimizes to a single ‘range’ select (as far as my limited understanding of MySQL optimization can tell) . Yours doesn’t.

What’s your problem with building the query dynamically?

10 On June 9, 2006 at 19:25, Keith wrote:

That’s precisely the point of the challenge. I know the query I gave was awful. Unrolling it as you did gives the optimiser the necessary hints to use the index properly.

I’ve nothing against dynamic SQL, but prefer not to dynamically build my queries if possible: it usually means more trouble figuring out what the code’s meant to do. In this instance, a dynamic query doesn’t solve the problem because it means introducing a second layer of logic to build the query. The challenge is to solve the general problem using SQL alone. A bit masochistic, I know, but there you go. [smile]

One other thing: rather than IF(MAX(day) IS NULL, 0, MAX(day)), what you really is the COALESCE() function. Using that’d simplify the expression down to COALESCE(MAX(day), 0).

11 On June 10, 2006 at 11:01, Greg Gaughan wrote:

(We’re probably related! My grand-parents were from Mayo.)

I wouldn’t use Mysql, of course, but would it allow you to change the subquery into a join, e.g.

SELECT  id
FROM    entries JOIN
(
    SELECT  DISTINCT published
    FROM    entries
    ORDER BY published DESC
    LIMIT 7
) AS dates_with_entries USING (published)

and maybe that would allow the LIMIT. If not, perhaps it will allow you to create a view with a LIMIT? e.g.

CREATE VIEW dates_with_entries AS (
    SELECT  DISTINCT published
    FROM    entries
    ORDER BY published DESC
    LIMIT 7
)

and then join to that.

12 On June 10, 2006 at 16:07, Keith wrote:

We’re probably related! My grand-parents were from Mayo.

It’s quite likely we’re related, if a little distantly. It’s a Mayo/Sligo name, I’m from Sligo, and my grandfather was from Foxford in Mayo. It’s also not the most common of names.

And we have a winner! That works, though it ought to be kept in mind that your query requires a DATE column rather than a DATETIME column.

Now, I have to figure out a way of writing that so that it works on MySQL 4.1, which was the current version when my original query was written. But that’s a challenge for myself.

Post a comment

All form information is optional, but it’s a good idea to fill in your name and email address if you want me to take your comment seriously.

Spammers, don’t bother posting crap down here. The site is set up so that legitimate search engines (Google, for instance) won’t index pages with comments on them. Posting crud here only means you’re wasting my time and patience. Shoo!

Real names, please. Please include!
Won’t be displayed. Please include!
Displayed, if present.