Task Information for Language

Task Author: Gordon Cormack (CAN)

The nature of this problem is innovative within the IOI. Its purpose is to bring the field of information retrieval under the attention. This problem is discussed in detail in the book Information Retrieval: Implementing and Evaluating Search Engines by S. Büttcher, C.L.A. Clarke, and G.V. Cormack (MIT Press, to appear soon). Especially see Chapter 10 on Categorization and Filtering.

One important observation is that excerpts from the same language version of Wikipedia will share some characteristics in a statistical sense. Because many random excerpts are offered, the variability between excerpts from the same language play a negligible role. It has been confirmed that the statistical resemblance between the provided test input and the official grader input is highly predictable.

Note that because of the random re-coding of the language codes and symbol codes, there is no opportunity to hard code any specific (personal) language knowledge into a solution.

There are many approaches possible. Rocchio's method, which was informally described in the task description, suffices to solve Subtask 1.

For Subtask 2, one needs to do more than simply look at symbol frequencies. Collecting statistics on bigrams (pairs of neighboring symbols), trigrams (three consecutive symbols) will yield higher accuracies.