Why do we need another post cluttering up the Interpipes on how to find a set of documents in a MongoDB collection that contain a non-empty array field? It’s not like we suddenly have a shortage of posts and articles on this topic after all. Well, it maybe a shocking revelation but not all of these posts and Stack Overflow answers are, well, as correct as I’d like them to be.

Go on, what’s wrong with the typical approach then?

When you search for ‘how to find MongoDB documents with non-empty arrays’, you usually end up finding suggestions like db.test.find({ my_huge_array: { $exists: true, $ne: [] } }) or variations thereof. And this approach works most of the time, especially if you have a fairly rigid schema and a little helping of luck. The only problem with this approach is that, well, it’s almost correct but not correct enough to be robust. But, I hear you say you’ve used this find clause in production for years and it just works, what I am talking about?

Well, the issue with this match expression is that there isn’t actually a check in there anywhere that says “verify that ‘my_huge_array’ is really an array”. If you have a fairly rigid schema and basically enforce that my_huge_array is an array either via your application or (better) via MongoDB’s schema validation, well, this works and will generally return what you’re asking for.

The wheels however come off very rapidly if you’re making use of MongoDB’s flexible schema. Let’s imaging that somewhere in your test collection, a document is hiding that looks like this:

  my_huge_array: "I only pretend to be an array"

Will the find expression above result in a match on this document or not?

Take a guess.

Are you sure?

Well, you may have guessed it – the answer is yes, it will match the document containing the snippet above. Why is that?

Let’s analyse the find expression and see what it actually says:

  • Match any document that contains a field ‘my_huge_array’
  • (implicit and)
  • Match any document where ‘my_huge_array’ is not equal to the empty array

Note that it doesn’t say anywhere “and where the type of ‘my_huge_array’ is array’? That’s exactly the problem with this expression and its various permutations – it will match any document that contains the field you’re looking for as long as it’s not an empty array. The field being a non-empty array is just a bonus and maybe Lady Luck smiling on you that day. Which usually doesn’t happen in production and you tend to find out at 3am.

Right, so how do I really search for documents containing a non-empty array in MongoDB?

There are two options to query for a non-empty array and do it correctly. One of them requires MongoDB 3.6 or newer, whereas the other one works from MongoDB 2.2. Not that you want to use MongoDB 2.2 in this day and age. Anyway, I digress – here is the expression that works in all MongoDB versions from 2.2 onwards:

db.test.find({ $or: [{my_huge_array: []}, {"my_huge_array.0" : { $exists: true }}]})

Almost looks like the original query expression, right? The one I said wasn’t correct? So what’s the big difference then? Well, the big difference is that the second part of the expression specifically checks for the presence of an array element at position 0 in the array. That expression is only true when if ‘my_huge_array’ is actually an array and not any other data type, and has at least one element in the array.

OK, are we done, then?

Well, not quite – the expression above has one big downside, namely that it requires a collection scan even if you have a multikey index on the my_huge_array field. MongoDB multikey indexes only record the value of the array element, but not its position in the array. As a result, MongoDB can’t use an index to optimize the query above and you’re in collection scan territory. Which is usually not a pretty place to be in.

There is an alternative that actually does use an index, but you’ll have to use MongoDB 3.6 or newer for it to work. That really shouldn’t be a big a burden given that 3.6 is the oldest still supported release at the time I am writing this. This particular query expression makes use of the $type operator and looks like this:

db.test.find({ my_huge_array: { $type: "array" } })

You can also use $type: 4 if you prefer the numeric type notation. I usually prefer the string notation as I find it more readable.

So it’s all rainbows and dancing unicorns then?

We’re getting close now. The second impression actually does either need some cooperation from the application or the schema, because the simple type check also matches empty arrays. It /only/ matches arrays so we’re already ahead there, though. So right now we’re left with two options:

  • Add our ’empty array’ check back in
  • Don’t store empty arrays in the first place.

I’m pretty much in the second camp that you should avoid storing empty fields (including empty arrays) in MongoDB. I’ll explain in a different post why that is, but the tl;dr answer is because it mostly blows up the document size with something that the MongoDB driver can handle transparently anyway. The only potential exception would be if the schema validation requires that the field is present, but if that’s the case I’d argue that it shouldn’t be an empty or null value either.

Thanks to Asya Kamsky for her comment on a MongoDB server ticket that was invaluable in this deep dive.

Disclaimer: I work for MongoDB as a Consulting Engineer. This is my personal blog and any statements, opinions, suggestions and recommendations are my own and do not reflect opinions of employers past, present and future.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.