Description

This is a course about the techniques and tools that are used to automatically crawl, parse, index, store, rank, and search the Web for information. By the end of this course, students will have implemented a complete (but simplified) Web search engine. Specific topics will include:

Basic information

Time & Location
Inofrmatic East (I2) 130
Tuesday 5:45pm-7:00pm
First meeting: Aug. 23th, 2016 (Tuesday)
Labs
I 109
Thursday 11:15pm - 12:30pm
Thursday 5:45pm - 7:00
Announcements
All students have been added to the course mailing list ((i427_fall16-l@indiana.edu). If for any reason you are not a part of the mailing, you must join it. In order to receive course announcements you need to read the emails from I427 mailing list. If you have any questions or comments during the course please utilize the mailing list.
Instructors
You may send any question to the instructor mailing list: ((i427_instructors-l@indiana.edu).

Azadeh Nematzadeh (azadnema@indiana.edu)
Office: Informatics East Room 322A
Office hours: Tuesday 2:00pm - 3:00pm
(you can also schedule a meeting if you cannot attend during this time)

AI
Pik-Mai Hui (huip@umail.iu.edu)
Office: Informatics East room 400
Office hours: Thursday 2:00pm - 3:00pm

Thomas Parmer (tjparmer@indiana.edu)
Office: Informatics East room 400
Office hours: Wednesday 4:00pm - 5:00pm

Textbook
Introduction to Information Retrieval
Other resources
Think Python
Dive into Python
Effective Python
Test Driven Development: By Example
Mining of Massive Datasets
Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More
Prerequisites
Either INFO I211 and I308, or CS C211 and C212, or some comparable introductory programming course sequence. Please note that this course will require a significant amount of programming; depending on your programming background, you may need to invest a significant amount of out-of-class time in learning and practicing programming skills.

Schedule

This schedule will be updated throughout the semester.
WeekDateClass Topic Readings and Resources Due Date
1 Aug, 23 Introduction Complete the survey by 8/24 11pm
1 Aug, 25 Setting up Assignment 1
(due: 8/30, 11pm)
2 Aug, 30 Python
2 Sept, 1 Hackerrank exercise on Data Structure lab 2
(due: 9/8, 1am)
3 Sept, 6 Parsing
  • Chapter 6 of Effecttive Python
  • Chapter 11 of Think Python
  • Chapter 2, 6, and 15 of Introduction to Information Retrieval
  • today's slides
Assignment 2
(due: 9/29, 11pm)
3 Sept, 8 Text Processing and String Manipulation lab 3
(due: 9/15, 1am)
4 Sept, 13 OOP
  • Chapter 3 and 4 of Effecttive Python
  • Chapter 5 of Dive into Python
  • today's slides
4 Sept, 15 Assignment 2
  • Working on the Assignment 2
attendance is mandatory
5 Sept, 20 Crawling
5 Sept, 22 Assignment 2
  • Working on the Assignment 2
attendance is mandatory
6 Sept, 27 Crawling
6 Sept, 29 Graph Traversal lab 6
(due: 10/6, 1am)

Assignment 3
(due: 10/19, 11pm)
7 Oct, 4 Assignment 2 assessment attendance is mandatory
7 Oct, 6 Web Crawling
  • working on Assignment 3
attendance is mandatory
8 Oct, 11 Indexing
  • chapter 20 of IR book
  • chapter 5
  • today's slides
  • 8 Oct, 13 HTML Parser
  • HTML parser
  • HTML parser with lxml
  • example
  • 9 Oct, 18 IR
    9 Oct, 20 Map reduce Assignment 4
    (due: 11/2, 11pm)
    10 Oct, 25 Unit test
    10 Oct, 27 Unittest attendance is mandatory
    11 Nov, 1 Pagerank
    11 Nov, 3 How to write a software
    • Assignment 5
      (due: 11/16, 11pm)
    • attendance is mandatory
    12 Nov, 8 PageRank
    12 Nov, 10 Demo: Inverted Index
    • Live coding: we demo how to design, implement, and test the Inverted Index
    • attendance is mandatory
    13 Nov, 15 PageRank
    13 Nov, 17 Web programming Final project
    (due: Dec 10, 11:59pm)
    14 Nov, 22 Thanksgiving
    14 Nov, 24 Thanksgiving
    15 Nov, 29 SEO and Ethics
    15 Dec, 1 Retrieval Demo: Retrieval
    16 Dec, 6 dead week
    16 Dec, 8 dead week
    17 Dec, 13 Final project assessment

    Policies

    Grading policy

    Lab assignments will include programming exercises through HackeRrank. These are individual exercises.

    Pop quizzes will consist of multiple-choice and short answer questions and they will review course material.

    Assignments will require you to write code with Python.

    Final Project will be building a complete web search engine and an accompanying report.

    *You will get an F grade if you fail to submit the final project.
    *You may work with a partner on assignments and the final project.
    *GitHub activities will affect your grade, especially if you are working in a team. Remember to check in!

    Acknowledgment

    This course's material is built upon the past course offered by Prof. David Crandall. I would like to thank him for sharing his materials as well as his guidance.