May 25, 2006 at 9:05AM Sick SQL: Getting the IDs of all the entries in the past seven days with entries
Here’s a bit of SQL (albeit simplified to remove needless details) that I used to use here to generate the frontpage until its slowness, even using query caching, drove me to simply list the last x entries instead. Now keep in mind that what this does is find all the entries in the last seven days that have had entries posted on them, and those days do not need to be consecutive, which is what makes this query so horrifyingly fascinating in my eyes.
SELECT E2.id FROM entries AS E1, entries AS E2 WHERE E1.published >= E2.published GROUP BY E2.published HAVING COUNT(DISTINCT TO_DAYS(E1.published)) <= 7
Frightened? You should be. Absolutely nothing can be done, best as I can tell, to make this query more efficient, and I haven’t been able to discover an alternative way of doing it that doesn’t entail resorting to cursors. The best I’ve been able to do is make an index that consists of the published and id columns.
Here’s what MySQL has to say about it:
| Select Type | Table | Type | Key | Rows | Extra |
|---|---|---|---|---|---|
| SIMPLE | E1 | index | ix_published | 144 | Using index; Using temporary; Using filesort |
| SIMPLE | E2 | index | ix_published | 144 | Using where; Using index |
Urgh! That’s correlated subquery nasty! Mind that there are 144 rows in the table, so it’s scanning the whole bloody thing! It was ran against an old backup of the DB I use locally for testing.
Now, if you think you’re hard, I challenge you to come up with a better way of doing this. Buggered if I know! There’s no reward here except my undying admiration.
Update (2006-06-08): The published column is a DATETIME, not a DATE, hence the use of TO_DAYS().
1 On May 25, 2006 at 20:39, John Handelaar wrote:
Seems that this is a massively oversimplified example, as it suggests strongly that:
...would give you the exact same result. Which I’m sure isn’t what you’re asking.
2 On May 25, 2006 at 20:40, John Handelaar wrote:
grr hateful autoformatting without warnings.
3 On May 25, 2006 at 23:43, Keith wrote:
Yeah, I changed my formatting code recently and haven’t got around to including a note or a preview button yet as I’m rewriting other chunks of it to include tagging, better caching, merging in my linklog, a comments feed, and a bunch of other things. The hastily written manual for the formatter is here. Sometime I wonder if I’d might as well just get rid of all formatting except for links and emphasis in the comments...
I really, really wish that did have the right effect, but it doesn’t. You see, the days aren’t necessarily contiguous, and that’s the key point here. The first day could be yesterday, the next one last week, and so on. It collects together the entries of the most recent seven days that have entries, ergo the horrible complexity.
Formatting fixed up too.
4 On June 8, 2006 at 15:00, Greg Gaughan wrote:
It doesn’t make sense to
SELECT E2.idif you’re onlyGROUPing byE2.published(which one should it return when there are many?).Also the self-join query looks crazy.
I think you need something like:
SELECT id FROM entries WHERE published IN ( SELECT DISTINCT published FROM entries ORDER BY published DESC LIMIT 7 )5 On June 8, 2006 at 19:06, Keith wrote:
Oh! Somebody with the same surname as me! It’s not often I run across people like that online.
Your query won’t work on MySQL because it doesn’t allow
LIMITin subselects. That’s the reason for the self-join, which I agree does look somewhat crazy. All that complexity is an attempt to simulate what you’re doing with the query you gave.The
SELECT E2.idmight be depending on a happy accident of the way MySQL works. I’ll need to check that out. But I still can’t think of a better way to do this that doesn’t entail breaking it into two queries.6 On June 9, 2006 at 4:36, Rowan Nairn wrote:
Call me crazy, but how about this: First off, make a new indexed DATE column that holds the day of each entry. (Is that cheating?) Then go:
7 On June 9, 2006 at 4:47, Rowan Nairn wrote:
Actually you only need the IF in the outermost subquery, right?
8 On June 9, 2006 at 15:56, Keith wrote:
Thanks for trying, Rowan. That’s close, but no cigar. There’s one problem with it: It’s hardwired for seven days, but the challenge it to solve it for x days. Of course, a way around this is to dynamically construct the query for whatever number of days you want. Your code is essentially an unrolled version of what I was trying to do with the nasty self join.
9 On June 9, 2006 at 18:58, Rowan Nairn wrote:
But mine optimizes to a single ‘range’ select (as far as my limited understanding of MySQL optimization can tell) . Yours doesn’t.
What’s your problem with building the query dynamically?
10 On June 9, 2006 at 19:25, Keith wrote:
That’s precisely the point of the challenge. I know the query I gave was awful. Unrolling it as you did gives the optimiser the necessary hints to use the index properly.
I’ve nothing against dynamic SQL, but prefer not to dynamically build my queries if possible: it usually means more trouble figuring out what the code’s meant to do. In this instance, a dynamic query doesn’t solve the problem because it means introducing a second layer of logic to build the query. The challenge is to solve the general problem using SQL alone. A bit masochistic, I know, but there you go.![[smile]](/images/smilies/smile.gif)
One other thing: rather than
IF(MAX(day) IS NULL, 0, MAX(day)), what you really is theCOALESCE()function. Using that’d simplify the expression down toCOALESCE(MAX(day), 0).11 On June 10, 2006 at 11:01, Greg Gaughan wrote:
(We’re probably related! My grand-parents were from Mayo.)
I wouldn’t use Mysql, of course, but would it allow you to change the subquery into a join, e.g.
SELECT id FROM entries JOIN ( SELECT DISTINCT published FROM entries ORDER BY published DESC LIMIT 7 ) AS dates_with_entries USING (published)and maybe that would allow the
LIMIT. If not, perhaps it will allow you to create a view with aLIMIT? e.g.CREATE VIEW dates_with_entries AS ( SELECT DISTINCT published FROM entries ORDER BY published DESC LIMIT 7 )and then join to that.
12 On June 10, 2006 at 16:07, Keith wrote:
It’s quite likely we’re related, if a little distantly. It’s a Mayo/Sligo name, I’m from Sligo, and my grandfather was from Foxford in Mayo. It’s also not the most common of names.
And we have a winner! That works, though it ought to be kept in mind that your query requires a
DATEcolumn rather than aDATETIMEcolumn.Now, I have to figure out a way of writing that so that it works on MySQL 4.1, which was the current version when my original query was written. But that’s a challenge for myself.