Tuesday, May 12, 2009

Looking ahead

IR is a lot more vibrant now than it was even ten years ago, and more so then than ten years before that.

Obviously, much of this is due to the web, but now social networking, e.g. Facebook and Twitter, are becoming more important.

How many of you use Twitter? Why or why not?

What will come after Twitter?

Areas come in and out of fashion. CLIR was hot for a while, and is now in decline, but will likely rise again. Virtual machines were hot, then cold for a long time, then hot, and now much hotter.

Who knows what's next? See http://www.smalltalk.org/alankay.html

Thursday, April 16, 2009

questions for Mike @ google

Is Google hiring?
What about interns?
What is Mike's 20% project about?
Is there room for improvement in the Page-Rank algorithm?
Is Google using clustering, e.g. in QP or results presentation?
What is Google's role w.r.t integrity of the Internet?
Will G apply clustering to Book Search?
What areas of research does Google see as most promising, especially within AI?
How much NLP is used in QP?
Any comments on DDOS?
What is Mike's job, anyway?
Any other interesting 20% projects he can discuss?
Novel applications of Map-Reduce?
Does Google have a favorite security model, e.g. Chinese Wall or Lipner's?
When will Google Music search be available?
How many Ferraris in the Google parking lot?
Is the food as good as before?

Tuesday, March 24, 2009

476/676 Writing Project

It is sometimes desirable to write comprehensive surveys of a given area of science, including a thorough review of the literature, and a list of open problems.

It is also sometimes useful to write a short summary of a given area, based on a few of the most recent or most important papers in the literature. Then one can ask questions like, what is the most important issue, and what is the most promising approach to addressing that issue? Such documents are called by different names, e.g. white papers.

When one approaches an agency such as the National Science Foundation (NSF) for money to do research, the first is to write a white paper. The structure is in four parts, based on these questions: what's the problem, what have others done to solve it, what is the most promising approach at this point, and what would it take to pursue that approach?

The 476/676 writing project this semester is to choose a topic within (or related to) information retrieval, and write a short white paper on that topic. The paper may be no more than five pages in length, single space, no more than ten references, and at most three small figures.

Possible topics, not an exhaustive list: building large collections for IR evaluation; distributed IR, especially collection selection, or results fusion; cross-language IR; variations on the vector space model, such as GVSM; variations on latent semantic analysis; searching specialized corpora, such as music, patents, images, movies, etc.; searching the semantic web; specialized computer architectures for IR; clustering algorithms, especially variations on k-means; information filtering, especially spam detection; text summarization; information extraction; IR systems for the disabled; recommender systems; adversarial IR; text categorization, especially feature selection; use of linear algebra in IR, e.g. scalable matrix decomposition methods.

Many but not all of these may be mentioned in the textbook. We won't have a chance to cover too much of any of these in class, although we will talk about LSA and clustering. But there is lots of material available on all of these topics!

Important dates: Choose a topic, and send an email to me (and cc Don Dimitroff) describing your topic. I may suggest that the topic may broadened or narrowed. Do this by April 7.
The paper will be due in class on May 12.

Thursday, March 5, 2009

Preparing for the midterm exam

There are several exercises given in the textbook. Some are more useful than others as models for exam questions.

For the first six chapters of the textbook, I can suggest exercises as follows:

Exercises 1.1, 1.2, 1.3 (all easy for you now)
1.7, 1.8, 2.1, 2.5, 2.6, 2.8, 2.14, 3.7, 3.8, 4.2, 4.5, 5.1, 5.18, 6.8-17, 6.19-6.23

You can ignore 1.4,1.5,1.6, 1.9-1.13, 2.2-4, 2.7, 2.9-2.13, 3.1-3.6, 3.9-3.15, 4.1, 4.3, 4.4, 4.6-4.12, 5.2-5.17, 6.1-6.7, 6.18, 6.24

Tuesday, February 10, 2009

notes before class 2/10/09

Last week we did chapters 1 and 2 of the textbook.

The Boolean retrieval model is not used in practice as much as in years past, but as an example of the issues that arise in IR systems in general, it's helpful I think. The concepts related to managing the "term space", i.e. the set of terms used to represent the documents in a corpus, are still quite relevant.

N-grams have some advantages and disadvantages over words, when it comes to terms used to represent a document. I'll explain this in class, in the future if not today.

The first phase of the project is due today. Some students have offered to attach their programs to the submission, allowing me or Mr. Don to run the programs. This is acceptable but not necessary.

I'll be releasing phase two today, I think.

Friday, January 30, 2009

Searching of blogs is an important topic in IR right now. For a survey that you may find interesting, see
http://feedproxy.google.com/~r/readwriteweb/~3/qUIz5VV1jyw/the_state_of_blog_search_engines.php

Thursday, January 29, 2009

remarks for Thursday 1/29

So as usual, I had a lot more slides to show on Tuesday than there was time for. So I have a lot to discuss today...

You should have read Chapter One of the MRS textbook. (MRS is the acronym made from the authors' surnames.)

The first phase of the term project has been made available on the web site, through the course home page.

I'll need to talk about the syntax of HTML at some length, and relate that to the concept of tokenization. These concepts were probably covered when you took 331. But maybe you haven't taken 331 yet.

Anyway, there are lots of ways to tokenize an HTML file. You can write custom code in the language of your choice, of course. If you go this route, you might consider a finite state machine.

You can use compiler construction tools, such as lex (or its faster workalike, flex) along with some C or C++ code.

Tutorials and examples of using lex/flex with Java, C and C++ are available on the web.

Python and Perl (I think) have their own HTML parsers. Or you can use the string processing, such as regular expressions and pattern matching, in Python or Perl or whatever.

It's worth remembering that we don't care much about the HTML tags themselves - it's the character data that we want to process. Not all documents on the Web are strictly correct in terms of HTML syntax, but browsers tend to tolerate them anyway.