Python DOM HTML functional
Posted Sun, Jan 13, 2008 in:DOM HTML Progress
Well, I’ve made some progress on the HTML layout engine, but it still isn’t complete enough to run yet.
When I got to the point where I needed to call ViewCSS.getComputedStyle from the DOM, I stopped to actually implement it, and decided that it was a good time to actually see if the DOM HTML code I had written would run.
It didn’t, of course.
So I spent some time fixing all the little bugs here and there, and set up some test code to pull a page from the internet and parse it into a full HTML DOM tree.
Since I’m using pxdom’s parsing functions and pxdom only knows how to parse proper XML, I also run the HTML through the python tidy lib first to ensure that it’s proper XHTML. Without doing that I couldn’t even parse the Google home page.
Here it is if you’d like to check it out. It needs pxdom to work. The parseString function will take a string containing HTML and return an HTMLDocument.
Right now it’s usable for manipulating and traversing the DOM with all the attributes you are used to being able to access from Javascript, and if you combine it together with this simpleget module (requires domhtml and utidy from above), you can use it for some basic web scraping purposes.
Remember, it is basically pre-alpha code, since I haven’t tested everything yet. I might get around to writing up some unit tests at some point, but until then I can’t guarantee that there are no errors.
JavaScript update
I didn’t spend all my time in the last couple of days just fixing up my DOM HTML implementation. I also did some research on JavaScript interpreters, and I think I’ve decided that I’m going to wrap SEE, a JavaScript interpreter written in C, with a Python module and use it for the scripting engine for my browser.
I considered using Spidermonkey, but after looking through the documentation for each, SEE seems like it will be much easier to wrap, and as far as I can tell, it supports JavaScript exceptions in an easier-to-use (and easier-to-integrate-with-Python-exceptions) way than Spidermonkey does.
SEE also handles memory management for you and you can fully separate interpreter instances so you don’t have to worry about thread safety, which is two less things to have to worry about.