Friday, November 28, 2008

Thwarted by lack of speed

I was hoping to make an announcement of a new cool app based on Google's App Engine but unfortunately I have been thwarted by Python's relative lack of speed.

I have started working on a new version of Crunchy that would run as a web app on Google's servers. While the current version of Crunchy fetches existing html pages, processes them and displays them in the browser, this new version would retrieve html page content (in reStructuredText format) from Google's datastore, transform it into html, process it to add interactive elements, and then displays them.

This new app was going to be usable as a wiki to create new material. This was my starting point, greatly helped by an already existing wiki example that I adapted to use reStructuredText. When requesting a page, the following was supposed to happen:

1. reStructuredText content (for the body of the html page) is fetched from the datastore.
2. said content is transformed (by docutils) into html
3. html content is further processed by modified "crunchy engine" to add interactive elements.
4. modified html content is inserted in page template and made available.

The user would then be able to enter some Python code which could be send back to the App Engine using Ajax for processing and updating the page display.

A normal user would only be able to interact with already existing pages. Special users ("editors") only would have been able to add pages. I was hoping that people teaching Python would be interested in writing doctest-based exercises and that a useful collection could be implemented over time.

Unfortunately, this approach can not work, at least not using Google's App Engine on Google's own servers. :-(

Just playing with small pages, steps 1 and 2 are long enough that I get warnings logged mentioning that requests are taking too long. I know from experience that step 3 (which I have not started to implement/port from the standard Crunchy) can take even longer for reasonably size pages. So, this does not appear to be feasible ... which is unfortunate.

I think I will continue to develop this app to be used as a local one and perhaps write a second wiki-based app that would take html code with no further processing. I could use the first one to create a page, have it processed and use the "view source" feature of Firefox to cut and paste the content into the online app. This would remove the need for any processing of pages on Google's servers - only Python code execution would need to be taken care of. (Of course, a user could enter some code sample that would take too long to execute and hit Google's time limit ...)

If anyone has a better idea, feel free to leave it as a comment.

5 comments:

Michael Watkins said...

I use reStructuredText for a content management system and found it useful to cache the output from docutils. I retain the reST content in my datastore; write out the post processed reST data as xhtml (which is also run through tidy) - and save that all in a file system cache.

Only the first hit therefore goes through docutils, although in my experience that first hit isn't terribly slow even for moderately large documents but I'm using my own servers not Google's.

Still, even on old hardware the difference is between 7 - 10 requests per second for rendering the document fresh every time, or 200 requests per second for serving up the cached html fragment.

The nice thing about the solution is I can delete the cache directory and it all gets rebuilt as the documents are accessed.

Perhaps pre-rendering and caching can work for part of your need.

André Roberge said...

Michael:

Thanks for your suggestion. I won't be able to use it directly on one of Google's server as I doubt it would get around the time limitation when attempting to build the system cache. However, I might try to use it locally and perhaps create static files (instead of using a cache) and upload them to the server.

Tony Arkles said...

André,

We use App Engine for a project at work, and I don't think you should discard the idea yet!

We've found that retrieving a URL can be a slow process -- pretty much independent of the size (for reasonably-sized HTML). Based on your description, this won't be happening too often: most of your content will be served to "regular users".

If you're concerned that your urlfetch and then processing is going to take too long for a single request, you can split the task up into two requests. The first would retrieve the raw HTML into the datastore, and the second would do the processing.

The time limitation is there for you to identify which tasks are heavy CPU users, so that you can optimize them. If these tasks happen infrequently (compared to the total traffic on your site), you should be fine.

Joseph said...

I am a newb to Google App engine but could you cache the result of the reST processing in Google's database? The impression I got about Google App engine was that your best bet was to pre-calculate and cache everything.

Anonymous said...

I would suggest a few possibilities:

1. Process and cache when the text is submitted, rather than when it is retrieved.

2. Do the additional processing and url fetching in the browser using Ajax (I recommend the jQuery engine, but whatever you prefer really) instead of in AppEngine.

3. If you use something simpler than reStructuredText, such as Markdown, you can do the preprocessing in the browser as well, using Showdown: http://attacklab.net/showdown/