CMSC 476/676 Spring 2009: January 2009

So as usual, I had a lot more slides to show on Tuesday than there was time for. So I have a lot to discuss today...

You should have read Chapter One of the MRS textbook. (MRS is the acronym made from the authors' surnames.)

The first phase of the term project has been made available on the web site, through the course home page.

I'll need to talk about the syntax of HTML at some length, and relate that to the concept of tokenization. These concepts were probably covered when you took 331. But maybe you haven't taken 331 yet.

Anyway, there are lots of ways to tokenize an HTML file. You can write custom code in the language of your choice, of course. If you go this route, you might consider a finite state machine.

You can use compiler construction tools, such as lex (or its faster workalike, flex) along with some C or C++ code.

Tutorials and examples of using lex/flex with Java, C and C++ are available on the web.

Python and Perl (I think) have their own HTML parsers. Or you can use the string processing, such as regular expressions and pattern matching, in Python or Perl or whatever.

It's worth remembering that we don't care much about the HTML tags themselves - it's the character data that we want to process. Not all documents on the Web are strictly correct in terms of HTML syntax, but browsers tend to tolerate them anyway.

CMSC 476/676 Spring 2009

Friday, January 30, 2009

Thursday, January 29, 2009

remarks for Thursday 1/29

Welcome to the CMSC 476/676 Blog!

Followers

Blog Archive

Contributors