Thursday, January 29, 2009

remarks for Thursday 1/29

So as usual, I had a lot more slides to show on Tuesday than there was time for. So I have a lot to discuss today...

You should have read Chapter One of the MRS textbook. (MRS is the acronym made from the authors' surnames.)

The first phase of the term project has been made available on the web site, through the course home page.

I'll need to talk about the syntax of HTML at some length, and relate that to the concept of tokenization. These concepts were probably covered when you took 331. But maybe you haven't taken 331 yet.

Anyway, there are lots of ways to tokenize an HTML file. You can write custom code in the language of your choice, of course. If you go this route, you might consider a finite state machine.

You can use compiler construction tools, such as lex (or its faster workalike, flex) along with some C or C++ code.

Tutorials and examples of using lex/flex with Java, C and C++ are available on the web.

Python and Perl (I think) have their own HTML parsers. Or you can use the string processing, such as regular expressions and pattern matching, in Python or Perl or whatever.

It's worth remembering that we don't care much about the HTML tags themselves - it's the character data that we want to process. Not all documents on the Web are strictly correct in terms of HTML syntax, but browsers tend to tolerate them anyway.

2 comments:

  1. The link in the project description for files.tar.Z is broken; I've packed and made available a pack of all the HTML files in the relevant directory. The link may be corrected at some point, but for right now my link should match what's eventually posted.

    ReplyDelete