[CoreData] Subtleties And Performance

It so happens I got a project to “polish up” that relies heavily on CoreData, and has some huge performance issues. I can’t say much about the project, but suffice to say that a “normal” account on the app has 130+ entities, and 250 000 records in the sqlite database, for a grand total of roughly 150MB in size.

Funnily enough, during the development phase, one of the developers asked directly some people at Apple if it would turn out to be a problem, and obviously they said no, not at all. It made most of the more seasoned developers I asked slapping their thighs and laugh.

The problem is basically twofold: on the one hand, the huge number of entities (and their relationships) makes any query nonatomic – it requires a lot of back-and-forth between the storage and the memory; on the other hand, the huge number of records makes most results huge.

So let’s take a few examples of things that should be foresighted.

Lots of individual requests with stuff like an ID

Not happening. Ever. You don’t do something like

NSMutableArray *results = [NSMutableArray arrayWithCapacity:[fetchedIDs count]];
for(NSNumber *interestingID in fetchedIDs) {
  NSFetchRequest *fr = [[NSFetchRequest alloc] init];
  [fr setEntity:[NSEntityDescription entityForName:@"Whatever" inManagedObjectContext:AppDelegate.managedObjectContext]];
  [fr setPredicate:[NSPredicate predicateWithFormat:@"objectID == %@", interestingID]];
  NSArray *localResults = [AppDelegate.managedObjectContext executeFetchRequest:[fr autorelease] error:nil];
  if(localResults.count > 0)
    [results addObjectsFromArray:localResults];
}

Why? because in the worst case scenario there are 2 on-disk search accesses for every object you get. One to find the correct row and then one (or a bunch, depending on Apple’s implementation) to de-fault (load most values in memory) the object. Besides, if you do that pretty much everywhere in your code, you end up actually bypassing any kind of cache Apple could have set up.

Either implement your own cache (“logical ID” < -> NSManagedObjectID, for instance), or batch fetch stuff.

Lots of indirections

Looking for something like company.employees.position == "Developer" to find all the companies that have at least one developer, is expensive (and doesn’t actually work as-is).

First things first: making it work. What do we want here? All the companies in which at least one of the employees’s position is a “Developer”.

Traditionally, this is done through a subquery. A subquery is a way to fraction your search with as little performance penalty as possible. Basically, you reduce part of a statement to a simple boolean. Here:

(subquery(employees, $x, $x.position == "Developer")).@count > 0

the subquery statement will iterate through employees, find the ones that have the “Developer” position, consolidate the results as an array, and give me a count. If there’s 1 or more, that statement is true.

An other way of saying the same thing with a more natural language would be:

ANY employee.position == "Developer"

which will do pretty much the same thing. Performance-wise, it feels like the first one is faster, but I guess it all depends on your model and the amount of records, your indexes, etc etc.

Optimize your model for the most common searches

Let’s pretend I have a bunch of products that have a few requirements each, each requirement having a key I’m looking for. Imagine the list of Apple hardware products over the years, each one having a list of internal equipments (some of which might be in several products, like a modem, for instance), each being available in some countries in the world, but not all.

Now let’s say that based on this database, you have an entry point by country, which displays the Apple products available (filtered obviously by something like “all the parts in it are available in this country”). every time you’ll open that list, you’ll make a complex query like

"SUBQUERY(parts, $x, SUBQUERY(countries, $y, $y == %@).@count > 0).@count == parts.@count", country (untested)

Just imagine the strain… for every computer, you have to list all the parts that are allowed in a particular country and check if the count is good. That means loading each and every object and relationship just to check if it’s available.

So maybe your model computer⇄part⇄country isn’t ideal after all, for all its simplicity.

Maybe you should’ve set a field with all the country codes in which a computer is available, updating it as you change the parts, (in the willSave callback), so that the query could be something like "computer.availableIn CONTAINS %@", @"#fr#" (if availableIn is a string like #fr#us#de#it, and is obviously indexed) or anything faster but with only one indirection.

Kind of conclusion

As with everything else in computer science, the quality of an algorithm unfortunately has to be measured with the worst case scenario in the back of the mind. It’s all good to see in small scale tests that the whole database can be loaded up in RAM, speeding things up a treat, but in real world situations, the worst case scenario is that you’ll have to access on-disk stuff all the time. And on mobile platforms, it’s a huge bottleneck. Also, the simulator is a tool that doesn’t simulate very well, apart from graphically. My iPhone doesn’t have 8 cores and 16GB of RAM. Running basic performance tests on the worst targeted device should be part of the development cycle.

  

Leave a Reply