Friday, January 30, 2009

Searching of blogs is an important topic in IR right now. For a survey that you may find interesting, see
http://feedproxy.google.com/~r/readwriteweb/~3/qUIz5VV1jyw/the_state_of_blog_search_engines.php

Thursday, January 29, 2009

remarks for Thursday 1/29

So as usual, I had a lot more slides to show on Tuesday than there was time for. So I have a lot to discuss today...

You should have read Chapter One of the MRS textbook. (MRS is the acronym made from the authors' surnames.)

The first phase of the term project has been made available on the web site, through the course home page.

I'll need to talk about the syntax of HTML at some length, and relate that to the concept of tokenization. These concepts were probably covered when you took 331. But maybe you haven't taken 331 yet.

Anyway, there are lots of ways to tokenize an HTML file. You can write custom code in the language of your choice, of course. If you go this route, you might consider a finite state machine.

You can use compiler construction tools, such as lex (or its faster workalike, flex) along with some C or C++ code.

Tutorials and examples of using lex/flex with Java, C and C++ are available on the web.

Python and Perl (I think) have their own HTML parsers. Or you can use the string processing, such as regular expressions and pattern matching, in Python or Perl or whatever.

It's worth remembering that we don't care much about the HTML tags themselves - it's the character data that we want to process. Not all documents on the Web are strictly correct in terms of HTML syntax, but browsers tend to tolerate them anyway.

Welcome to the CMSC 476/676 Blog!

This is the blog for CMSC 476/676, offered Spring 2009 at UMBC

If you send me an email, and if you're in the class, I'll grant permission to post. As I understand it, anybody can read this blog. So keep that in mind! You might be writing for posterity!