CoreData, iCloud, And “Failure”

CoreData is a very sensitive topic. Here and elsewhere, it’s a recurrent theme. Just last week I had a hair-pulling problem with it that was solved in a ridiculous manner. I’ll document it later for future reference.

This week, triggered by an article on the verge, the spotlight came once again on the difficulties of that technology, namely that it just doesn’t work with iCloud, which by all other accounts works just fine.

It is kind of frustrating (yet completely accurate) to hear from pundits and users that iCloud just works for them for most things, especially Apple’s own products, and that CoreData-based apps work unreliably, if at all. The perception of people not actually trying to make it work is that it’s somehow the developer’s fault for not supporting it. Hence this article on the verge, which highlights the fact that it’s not the developer’s fault. This is a good intent, but unfortunately doesn’t solve anything, since it kind of waggles the finger at Apple and doesn’t explain anything.

But what is the actual problem?

CoreData is a framework for storing an application’s data in an efficient (hopefully) and compact way. It was introduced in 2005 for a very simple purpose: stopping the developers from storing stuff on the user’s disk in “messy” ways. By giving access to a framework that would help keeping everything tidied up in a single (for the “messy” part) database (for the “efficient” part), Apple essentially said that CoreData was a solution to pretty much every storage ailment that plagued the applications: custom file formats that could be ugly and slow, the headache of having “relationships” between parts of documents that would end up mangled or inefficient, etc.

CoreData is a simplification of storage techniques maintained by Apple and therefore reliable, is the underlying tenet. And for the most part, it is reliable and efficient.

iCloud, on the other hand, is addressing another part of the storage problem : syncing. It is a service/framework meant to make the storage on every device a user owns kind of the same storage space. Meaning, if I create a file on device A, it is created on B and C as well. If I modify it on C, the modification is echoed on A and B without any user interaction. Behind the scene, the service keeps track of the modifications in the storage it’s responsible for, pushes them through the network, and based on the last modification date and some other factors, every device decides which files on disk to replace with the one “in the cloud”. The syncing problem is a hard one, because of all the fringe cases (what if I modified a file on my laptop, then closed it before it sent something, then made another modification on my iPad? Which version is the right one? can we mix them safely?), but for small and “atomic” files, it works well enough.

iCloud is a simplification of syncing techniques maintained by Apple, and therefore reliable, to keep the tune playing. And for the most part, it does work as advertised.

But when you mix the two, it doesn’t work.

When you take a look at the goals of the two technologies, you can see why it’s a hard problem to solve: CoreData aims at making a monolithic “store-it-all” file for coherence and efficiency purposes, while iCloud aims at keeping a bunch of files synchronized across multiple disks, merging them if necessary. These two goals, while not completely opposed, are at odds: ideally, iCloud should sync the difference between two files.

But with a database file, it’s hard. It’s never a couple of bytes that are modified, it’s the whole coherence tracking metadata, plus all the objects referenced by the actual modification. Basically, if you want to be sure, you’d have to upload and replace the whole database. Because, once again, the goal of CoreData is to be monolithic and self-contained.

The iCloud philosophy would call for incremental changes tracking to be efficient: the original database, then the modification sets, ideally in separate files. The system would then be able to sync “upwards” from any given state to the current one, by playing the sets one by one until it reaches the latest version.

As you can see, a compromise cannot be reached easily. A lot of expert developers I highly respect have imagined a number of ways to make CoreData+iCloud work. Most of them are good ideas. But are they compatible with Apple’s vision of what the user experience should be? Syncing huge files that have been partially modified isn’t a new problem. And it’s one none of my various version control systems have satisfactorily addressed. Most of them just upload the whole thing.

Just my $.02.

  

[CoreData] Duplicating an object

As any of you knows, duplicating an object in coredata is just a nightmare : you basically have to start afresh each and every single time for each object, then iterate over attributes and relationships.

It so happens I have to do that often in one of my projects. I have to duplicate them except for a couple of attributes and relationships, and there are 20 of each on average (I didn’t come up with the model, OK?).

So, I came up with this code. Feel free to use it, just say hi in the comments, via mail, or any other way if you do!

@implementation NSManagedObject (Duplication)
+ (BOOL) duplicateAttributeValuesFrom:(NSManagedObject*)source To:(NSManagedObject*)dest ignoringKeys:(NSArray*)ignore {
    if(source == nil || dest == nil) return NO;
    if(![[source entity] isEqual:[dest entity]]) return NO;
 
    for(NSString *attribKey in [[[source entity] attributesByName] allKeys]) {
        if([ignore containsObject:attribKey]) continue;
 
        [dest setValue:[source valueForKey:attribKey] forKey:attribKey];
    }
 
    return YES;
}
 
+ (BOOL) duplicateRelationshipsFrom:(NSManagedObject*)source To:(NSManagedObject*)dest ignoringKeys:(NSArray*)ignore {
    if(source == nil || dest == nil) return NO;
    if(![[source entity] isEqual:[dest entity]]) return NO;
 
    NSDictionary *relationships = [[source entity] relationshipsByName];
    for(NSString *attribKey in [relationships allKeys]) {
        if([ignore containsObject:attribKey]) continue;
 
        if([((NSRelationshipDescription*)[relationships objectForKey:attribKey]) isToMany]) {
            [dest setValue:[NSSet setWithSet:[source valueForKey:attribKey]] forKey:attribKey];
 
        } else {
            [dest setValue:[source valueForKey:attribKey] forKey:attribKey];
        }
 
    }
 
    return YES;
}
 
@end
  

[CoreData] Subtleties And Performance

It so happens I got a project to “polish up” that relies heavily on CoreData, and has some huge performance issues. I can’t say much about the project, but suffice to say that a “normal” account on the app has 130+ entities, and 250 000 records in the sqlite database, for a grand total of roughly 150MB in size.

Funnily enough, during the development phase, one of the developers asked directly some people at Apple if it would turn out to be a problem, and obviously they said no, not at all. It made most of the more seasoned developers I asked slapping their thighs and laugh.

The problem is basically twofold: on the one hand, the huge number of entities (and their relationships) makes any query nonatomic – it requires a lot of back-and-forth between the storage and the memory; on the other hand, the huge number of records makes most results huge.

So let’s take a few examples of things that should be foresighted.

Lots of individual requests with stuff like an ID

Not happening. Ever. You don’t do something like

NSMutableArray *results = [NSMutableArray arrayWithCapacity:[fetchedIDs count]];
for(NSNumber *interestingID in fetchedIDs) {
  NSFetchRequest *fr = [[NSFetchRequest alloc] init];
  [fr setEntity:[NSEntityDescription entityForName:@"Whatever" inManagedObjectContext:AppDelegate.managedObjectContext]];
  [fr setPredicate:[NSPredicate predicateWithFormat:@"objectID == %@", interestingID]];
  NSArray *localResults = [AppDelegate.managedObjectContext executeFetchRequest:[fr autorelease] error:nil];
  if(localResults.count > 0)
    [results addObjectsFromArray:localResults];
}

Why? because in the worst case scenario there are 2 on-disk search accesses for every object you get. One to find the correct row and then one (or a bunch, depending on Apple’s implementation) to de-fault (load most values in memory) the object. Besides, if you do that pretty much everywhere in your code, you end up actually bypassing any kind of cache Apple could have set up.

Either implement your own cache (“logical ID” < -> NSManagedObjectID, for instance), or batch fetch stuff.

Lots of indirections

Looking for something like company.employees.position == "Developer" to find all the companies that have at least one developer, is expensive (and doesn’t actually work as-is).

First things first: making it work. What do we want here? All the companies in which at least one of the employees’s position is a “Developer”.

Traditionally, this is done through a subquery. A subquery is a way to fraction your search with as little performance penalty as possible. Basically, you reduce part of a statement to a simple boolean. Here:

(subquery(employees, $x, $x.position == "Developer")).@count > 0

the subquery statement will iterate through employees, find the ones that have the “Developer” position, consolidate the results as an array, and give me a count. If there’s 1 or more, that statement is true.

An other way of saying the same thing with a more natural language would be:

ANY employee.position == "Developer"

which will do pretty much the same thing. Performance-wise, it feels like the first one is faster, but I guess it all depends on your model and the amount of records, your indexes, etc etc.

Optimize your model for the most common searches

Let’s pretend I have a bunch of products that have a few requirements each, each requirement having a key I’m looking for. Imagine the list of Apple hardware products over the years, each one having a list of internal equipments (some of which might be in several products, like a modem, for instance), each being available in some countries in the world, but not all.

Now let’s say that based on this database, you have an entry point by country, which displays the Apple products available (filtered obviously by something like “all the parts in it are available in this country”). every time you’ll open that list, you’ll make a complex query like

"SUBQUERY(parts, $x, SUBQUERY(countries, $y, $y == %@).@count > 0).@count == parts.@count", country (untested)

Just imagine the strain… for every computer, you have to list all the parts that are allowed in a particular country and check if the count is good. That means loading each and every object and relationship just to check if it’s available.

So maybe your model computer⇄part⇄country isn’t ideal after all, for all its simplicity.

Maybe you should’ve set a field with all the country codes in which a computer is available, updating it as you change the parts, (in the willSave callback), so that the query could be something like "computer.availableIn CONTAINS %@", @"#fr#" (if availableIn is a string like #fr#us#de#it, and is obviously indexed) or anything faster but with only one indirection.

Kind of conclusion

As with everything else in computer science, the quality of an algorithm unfortunately has to be measured with the worst case scenario in the back of the mind. It’s all good to see in small scale tests that the whole database can be loaded up in RAM, speeding things up a treat, but in real world situations, the worst case scenario is that you’ll have to access on-disk stuff all the time. And on mobile platforms, it’s a huge bottleneck. Also, the simulator is a tool that doesn’t simulate very well, apart from graphically. My iPhone doesn’t have 8 cores and 16GB of RAM. Running basic performance tests on the worst targeted device should be part of the development cycle.

  

[CoreData] Honey, I Shrunk The Integers

Back in the blissful days of iOS4, the size you assigned to ints in you model was blissfully ignored: in the SQLite backend, there are only two sizes anyway – 32bits or 64bits. So, even if you had Integer16 fields in your model, they would be represented as Integer32 internally anyway.

Obviously, that’s a bug: the underlying way to store the data shouldn’t have any impact on the way you use your model. However, since using a Integer16 or an Integer32 in the model didn’t have any impact, a more insidious family of bugs was introduced. The “I don’t care what I said in the model, it obviously works” kind of bug.

Fast forward to iOS5. The mapping model (the class that acts as a converter between the underlying storage and the CoreData stack) now respects the sizes that were set in the model. And the insidious bugs emerge.

A bit of binary folklore for these people who believe an integer is an integer no matter what:

Data in a computer is stored in bits (0-1 value) grouped together in bytes (8 bits). A single byte can have 256 distinct values (usually [0 – 255] or [-128 – 127]). Then it’s power-of-two storage capacities: 2 bytes, 4 bytes, 8 bytes, etc…

Traditionally, 2 bytes is called a half-word, and 4 bytes is called a word. So you’ll know what it is if it seeps in the discourse somehow.

2 bytes can take 65636 values ([0 – 65 635] or [-32 768 – 32 767]), 4 bytes can go much higher ([0 – 4 294 967 295] or [-2 147 483 648 – 2 147 483 647]). If you were playing with computers in the mid-to-late nineties, you must have seen your graphic cards offering “256 colors” or “thousands of colors” or “millions of colors”. It came from the fact that one pixel was represented either in 8, 16 or 32 bits.

Now, the peculiar way bits work is that they are given a value modulo their maximum width. On one bit, this is given by the fact that:

  • 0 + 1 = 1
  • 1 + 1 = 0

It “loops” when it reaches the highest possible value, and goes to the lowest possible value. With an unsigned byte, 255 + 1 = 0, with a signed byte, 127 + 1 = -128. This looping thing is called modulo. That’s math. That’s fact. That’s cool.

Anyway, so, in the old days of iOS4, the CoreData stack could assign a value greater than the theoretical maximum for a field, and live peacefully with it. Not only that, but you could read it back from storage as well. You could, in effect, have Integer16 (see above for min/max values) that would believe as an Integer32 would (ditto).

Interestingly enough, since this caused no obvious concern to people writing applications out there, some applications working fine on iOS4 stopped working altogether on iOS5: if you try to read the value 365232 on an Integer16, you’d get 37552. If your value had any kind of meaning, it’s busted. The most common problem with this conversion thing is the fact that a lot of people love using IDs instead of relations. You load the page with ID x and not the n-th child page.

So, your code doesn’t work anymore. Shame. I had to fix such a thing earlier, and it’s not easy to come up with a decent solution, since I couldn’t change the model (migrating would copy the truncated values, and therefore the wrong ones, over), and I didn’t have access to, or luxury to rebuild, the data used to generate the SQLite database.

The gung-ho approach is actually rather easy: fetch the real value in the SQLite database. If your program used to work then the stored value is still good, right?

So, I migrated

NSPredicate *pred = [NSPredicate predicateWithFormat:@"id = %hu", [theID intValue]];
[fetchRequest setPredicate:pred];
NSError *error = nil;
matches = [ctx executeFetchRequest:fetchRequest error:&amp;error];

to

NSPredicate *pred = [NSPredicate predicateWithFormat:@"id = %hu", [theID intValue]];
[fetchRequest setPredicate:pred];
NSError *error = nil;
matches = [ctx executeFetchRequest:fetchRequest error:&amp;error];
 
// in case for some reason the value was stored improperly
if([matches count] == 0 && otherWayToID.length > 0) { // here, I also had the title of the object I'm looking for
  int realID = -1;
 
  NSString *dbPath = [[[[ctx.persistentStoreCoordinator persistentStores] objectAtIndex:0] URL] absoluteString];
  FMDatabase* db = [FMDatabase databaseWithPath: dbPath];
  if (![db open]) {
      NSLog(@"Could not open db.");
  }
 
  FMResultSet *rs = [db executeQuery:@"select * from ZOBJECTS where ZTITLE = ?", otherWayToID];
  while ([rs next]) {
      realID = [rs intForColumn:@"zid"];
  }
 
  [rs close];  
  [db close];
 
  if(realID >= 0) {
      pred = [NSPredicate predicateWithFormat:@"id = %u",[identifiant intValue]];
      [fetchRequest setPredicate:pred];
      error = nil;
      matches = [lapps.managedObjectContext executeFetchRequest:fetchRequest error:&error];
  }
}

In this code, I use Gus Mueller’s excellent FMDatabase / SQLite3 wrapper
Obviously, you have to adapt the table name (z<entity> with CoreData), the column name (z<field> with CoreData), and the type of the value (I went from unsigned Integer16 to unsigned Integer32 here)

Luckily for me (it’s still a bug though, I think), CoreData will accept the predicate with the full value, because it more or less just forwards it to the underlying storage mechanism.

Hope this helps someone else!

-nz