Google Appengine does not provide a straight forward way to upload data to the data-store. They have one article which indicates a way to upload data to Google Appengine data store. Using this method, one can upload CSV file to your google Appengine APP by making repeated “HTTP” requests. While this work for smaller data-sets (1000) records or less, it fails for bigger data sets. I had a requirement to upload mover 2 million records for my salary search app and the program kept failing. Resuming a bulk upload, resets the counter and it start all the way from the beginning causing duplicate data.
I have modified the bulk upload program to provide do the following things
- Introduced a parameter called “skip” which will skip “N” number of records when the bulk data uploader resumes
- A lot of times, the bulk data uploader fails because of either time-out issues or network issues. Instead of aborting in such scenario, the program just goes to sleep for 1 min before resuming from where it left.
- This program writes the progress log to a text file. So in case of a failure, the log file contains the entry to where the program failed last time. You can skip that many records when resuming.
- Sample Call :
bulkload_client_resume.py --filename upload_me.csv --kind Visa --url .appspot.com/load">.appspot.com/load">http://<your app>.appspot.com/load --skip 1000
In order to resume the Bulk Data uploader, you have modify the following sources in your SDK
- bulkload_client.py under Google\google_appengine\tools
- Make a copy of the existing file and create a new file with the name bulkload_client_resume.py
- Change following line of code in bulkload_client_resume.py
BULKLOAD_CLIENT_PATH = 'google/appengine/tools/bulkload_client_resume.py'
- bulkload_client.py under the following folder Google\google_appengine\google\appengine\tools
- Make a copy of existing file and create a new file with the name bulkload_client_resume.py
- Modify ImportCSV function
- Introduce a new parameter called skip, which will skip "X” number of records before resuming upload.
"""Imports CSV data using a series of HTTP posts.
filename: File on disk containing CSV data.
post_url: URL to post the Entity data to.
cookie: Full cookie header to use while connecting.
batch_size: Maximum number of Entity objects to post with each request.
kind: Entity kind of the objects being posted.
split_url, openfile, create_content_generator, post_entities: Used for
True if all entities were imported successfully; False otherwise.
host_port, uri = split_url(post_url)
csv_file = openfile(filename, 'r')
content_gen = create_content_generator(csv_file, batch_size)
logging.info('Starting import; maximum %d entities per post', batch_size)
for num_entities, content in content_gen:
#logic to skip
logging.info('Skipping entities in %d skip %d', i, int(skip))
logging.info('Importing %d entities in %d bytes current count %d',
LogMessage('Importing entities in bytes current count '+str(i) +'\n')
LogMessage(time.ctime(time.time()) +' :Importing entities in bytes current count '+str(i) +'\n')
content = post_entities(host_port, uri, cookie, kind, content)
except PostError, e:
logging.error('An error occurred while importing: %s', e)
logging.error('Going to sleep')
LogMessage(time.ctime(time.time()) + ' ERROR Count:'+str(i))
#Sleeps for 1 min before retrying the same record
- Introduce a new function, which will write a upload log to a log file.
logfile = open(‘bulkupload.log', 'a')
Feel free to drop a note if you have trouble getting this to work