24 Mar 2009

A tutorial on bulkloading data onto App Engine Datastore

A recent application migration to App Engine involves uploading a lot of old data from the SQL Server database into the datastore. The Google documentation only gives some simple examples, and there're two bulkloading tools available causing more trouble than help. I followed with bulkloader first and installed SQLite for it, but failed to make it work. Later on, I tried the bulkload_client tool and succeeded with it.


It is straightforward following the examples to do simple data uploading. You can just dump your database tables into CSV files, write a uploader class in Python to map all fields of a CSV row to datastore.Entity properties, and it should work. However, it's more troublesome if your data contain unicode strings, or if your model requires relationship, or referencing in Datastore terms.


1. A simple example:


Suppose we have a database table MyUser containing an ID as integer, a name as string and a mobile phone number as db.PhoneNumber type. And we have the entity class built like this:


class MyUser(db.Model):
id = db.IntegerProperty()
name = db.StringProperty()
mobile = db.PhoneNumberProperty()


The CSV file contains a number of rows with each row representing one user, and containing three fields separated with commas. For example:


1,David Johnson,31,07897601212
2,Mary Mann,24,0786321234


To upload this data, we need to write a Loader class which, like a Web request handler, stays on the server
and can be activated by a request from the client. It can be as simple as a couple of lines code.


from google.appengine.ext import bulkload
from google.appengine.api import datastore_types

class MyUserLoader(bulkload.Loader):
def __init__(self):
bulkload.Loader.__init__(self, 'MyUser', [('name',str),('mobile',datastore_types.PhoneNumber)])

if __name__=='__main__':
bulkload.main(MyUserLoader)


Our class MyUserLoader is a derived class from bulkload.Loader and the __init__ method (constructor) is responsible for specifying what data to upload into what entity class, in this case 'MyUser'. We call bulkload.main in the main program entry and pass our loader class name as its parameter. The most interesting part is the list that maps a data row in the CSV file. Each field in the list is a tuple of two elements, and the first element is the name of the property that will be put in the entity, and the second element tells the data type, it's actually a function that transforms the string from CSV to the value accepted by that property. In this example, 'name' is a string, so we use str to indicate the Python builtin function
str(x) will be used where x is the first string separated by commas in a CSV line. The second tuple specifies
the property named 'age' accepts an integer which will be converted from the second string in a CSV line
using int(x) function. The third tuple 'mobile' is of type PhoneNumber, and for loader class, we must use
datastore_types.PhoneNumber for db.PhoneNumber type.
Now we can save this class as file myloader.py, upload it to the production server if not running on local machine.


Next, we need to add this to app.yaml file:


-- url: /loadusers
script: myloader.py
login: admin


Note that if you run it on local development machine, comment off the last line login:admin so that you don't
need a credential to run the bulkloader. If on GAE, you can browse to /loadusers, and copy the cookie string
and insert into the bulkload_client command line argument.


Finally, we can run the bulkload_client tool on the command line:


python bulkload_client.py --filename myusers.csv --url http://localhost:8080/loadusers

Uploading will be done with 10 rows each round.


2. Working with datetime type


Now let's add a datetime property to our entity class:


updateTime = db.DateTimeProperty()


It's a bit tricky with date time data values. They can be of different format as output in the CSV file.
For example, we have an ISO format datetime string like this: 2009-01-30 12:24:01. And we must
specify the format to make it parse correctly. We can use Python datetime function strptime.
And let's modify the myloader.py file to add the following:


from datetime import datetime

[('updateTime', lambda x:datetime.strptime(x,'%Y-%m-%d %H:%M:%S')]


We used a lambda inline function to do the conversion from string to datetime object, instead, we can also write a function like this:


def times2val(dt):
return datetime.strptime(dt, ','%Y-%m-%d %H:%M:%S')


Then in the list, we write:


[('updateTime', times2val)]

3. No empty values allowed but None


One problem that I met is that some fields are empty, but the property does not allow empty fields, for example, the PhoneNumber property.



When we have an empty string value for the mobile field, we'll get an exception. Although this property does not accept empty strings as input, it accepts None value. So, to solve this problem, we can define a function to return None if an empty string is given.



[('mobile', lambda x:x != '' or None)]

4. Use different encoding


We have Asian-language names encoded as UTF-8 in the CSV file, and when we upload it, it reports exception using the default encoding. So we must specify the encoding.


We need to do this for both StringProperty and TextProperty properties.


name = db.StringProperty()
bio = db.TextProperty()

lambda s:unicode(s, 'utf-8') or None
lambda s:datastore_types.Text(s, encoding='UTF-8')

5. Build relationship with referencing with a key


Suppose we have another entity called MyPost which has a Reference property to the MyUser entity.


class MyPost(db.Model):
postid = db.IntegerProperty()
title = db.StringProperty()
author = db.ReferenceProperty(MyUser)
content = db.TextProperty()


How should we write a function for reference? First of all, we must make sure that the MyUser entities are
already loaded before we upload the MyPost entities, and we can query the MyUser entities by user id.


u = MyUser.all().filter('uid =',id)


This query is based on index of uid, and we can make it even faster by adding a key for the entities. So let's modify the MyUser loader and load the data again. (We must delete the old data before uploading again, otherwise, the entities will be duplicated.)


Unfortunately, we do not have a straightforward method of specifying a key_name in the loader class. However, we can override a method called HandleEntity in the loader class. A entity parameter is passed over to this method, and the entity contains the fields populated for one row, and this is the place where we can post-process the fields before submitting the entity to datastore. And here can we add the key_name. To do that, we should construct a datastore.Entity instance and add the current entity to it.


def HandleEntity(self, entity):
keyname = 'u%d' % entity['uid']
new_entity = datastore.Entity('MyUser', name=keyname)
new_entity.update( entity )
return new_entity


We set the user id as the key_name, and prepended a letter 'u' because key_names can't start with a number.


Having a key_name, we can use Model.get_by_key_name method to rapidly fetch an entity.


def usertype(uid):
u = MyUser.get_by_key_name('u%d' % uid)
if u is not None:
return u.key()
return None

def tenc(s):
return datastore_types.Text(s, encoding='UTF-8') or None

fields = [('postid', int),('title', str),('author',usertype),('content',tenc)]

class MyPostLoader(bulkload.Loader):
def __init__(self):
bulkload.Loader.__init__(self, 'MyPost', fields)

if __name__=='__main__':
mulkload.main(MyPostLoader)


In app.yaml:


-- url: /loadposts
script: mypostloader.py
login: admin


There seems to be no way of accessing other field values inside the type function. For example, a multiple field primary key cannot be processed in this way. It should be preprocessed so that the key is already combined into one string.

No comments:

Post a Comment