This is a course about the techniques and tools that are used to automatically crawl, parse, index, store, rank, and search the Web for information.
By the end of this course, students will have implemented a complete (but simplified) Web search engine. Specific topics will include:
- Python for information retrival.
- Vector space model and similarity measures
- Web crawling
- Indexing
- Retrieval and Ranking
- Link analysis and PageRank
- Search and query interfaces
- Search engine optimization
- Ethical and legal implications of search
- Time & Location
- Inofrmatic East (I2) 130
Tuesday 5:45pm-7:00pm
First meeting:
Aug. 23th, 2016 (Tuesday)
- Labs
- I 109
Thursday 11:15pm - 12:30pm
Thursday 5:45pm - 7:00
- Announcements
- All students have been added to the course mailing list
((i427_fall16-l@indiana.edu). If for any reason you are not a
part of the mailing, you must join it.
In order to receive course announcements you need to read the emails from I427 mailing list.
If you have any questions or comments during the course please utilize the mailing list.
- Instructors
- You may send any question to the instructor mailing list:
((i427_instructors-l@indiana.edu).
Azadeh Nematzadeh
(azadnema@indiana.edu)
Office: Informatics East Room 322A
Office hours: Tuesday 2:00pm - 3:00pm
(you can
also schedule a meeting if you cannot attend during this time)
- AI
- Pik-Mai Hui (huip@umail.iu.edu)
Office: Informatics East room 400
Office hours: Thursday 2:00pm - 3:00pm
Thomas Parmer (tjparmer@indiana.edu)
Office: Informatics East room 400
Office hours: Wednesday 4:00pm - 5:00pm
- Textbook
- Introduction to Information Retrieval
- Other resources
-
Think
Python
Dive into Python
Effective Python
Test Driven Development: By Example
Mining of Massive Datasets
Mining the Social Web: Data Mining Facebook, Twitter, LinkedIn, Google+, GitHub, and More
- Prerequisites
- Either INFO I211 and I308, or CS C211 and C212, or some comparable introductory programming course
sequence. Please note that this course will require a significant amount of programming; depending on your programming background, you
may need to invest a significant amount of out-of-class time in learning and practicing programming skills.
This schedule will be updated throughout the semester.
Week | Date | Class Topic
| Readings and Resources | Due Date |
1 |
Aug, 23 |
Introduction |
|
Complete the survey by 8/24 11pm |
1 |
Aug, 25 |
Setting up |
|
Assignment 1 (due: 8/30, 11pm) |
2 |
Aug, 30 |
Python |
|
|
2 |
Sept, 1 |
Hackerrank exercise on Data Structure |
|
lab 2 (due: 9/8, 1am) |
3 |
Sept, 6 |
Parsing |
- Chapter 6 of Effecttive Python
- Chapter 11 of Think Python
- Chapter 2, 6, and 15 of Introduction to Information Retrieval
- today's slides
|
Assignment 2 (due: 9/29, 11pm) |
3 |
Sept, 8 |
Text Processing and String Manipulation |
|
lab 3 (due: 9/15, 1am) |
4 |
Sept, 13 |
OOP |
- Chapter 3 and 4 of Effecttive Python
- Chapter 5 of Dive into Python
- today's slides
|
|
4 |
Sept, 15 |
Assignment 2 |
- Working on the Assignment 2
|
attendance is mandatory |
5 |
Sept, 20 |
Crawling |
|
|
5 |
Sept, 22 |
Assignment 2 |
- Working on the Assignment 2
|
attendance is mandatory |
6 |
Sept, 27 |
Crawling |
|
|
6 |
Sept, 29 |
Graph Traversal |
|
lab 6 (due: 10/6, 1am)
Assignment 3 (due: 10/19, 11pm) |
7 |
Oct, 4 |
Assignment 2 assessment |
|
attendance is mandatory |
7 |
Oct, 6 |
Web Crawling |
|
attendance is mandatory |
8 |
Oct, 11 |
Indexing |
chapter 20 of IR book
chapter 5
today's slides
|
|
8 |
Oct, 13 |
HTML Parser |
HTML parser
HTML parser with lxml
example
|
|
9 |
Oct, 18 |
IR |
|
|
9 |
Oct, 20 |
Map reduce |
|
Assignment 4 (due: 11/2, 11pm) |
10 |
Oct, 25 |
Unit test |
|
|
10 |
Oct, 27 |
Unittest |
|
attendance is mandatory |
11 |
Nov, 1 |
Pagerank |
|
|
11 |
Nov, 3 |
How to write a software |
|
- Assignment 5
(due: 11/16, 11pm)
- attendance is mandatory
|
12 |
Nov, 8 |
PageRank |
|
|
12 |
Nov, 10 |
Demo: Inverted Index
|
- Live coding: we demo how to design, implement, and test the Inverted Index
|
|
13 |
Nov, 15 |
PageRank |
|
|
13 |
Nov, 17 |
Web programming |
|
Final project (due: Dec 10, 11:59pm) |
14 |
Nov, 22 |
Thanksgiving |
|
|
14 |
Nov, 24 |
Thanksgiving |
|
|
15 |
Nov, 29 |
SEO and Ethics |
|
|
15 |
Dec, 1 |
Retrieval |
Demo: Retrieval |
|
16 |
Dec, 6 |
dead week |
|
|
16 |
Dec, 8 |
dead week |
|
|
|
17 |
Dec, 13 |
Final project assessment |
|
|
- Students must join the mailing list and read all of the emails and announcements carefully.
- Students are responsible for their assignment submissions.
Students must ensure a timely and complete submission through IU Canvas.
- The programming assignments and final project will be submitted through
GitHub.iu or HackerRank.
- Assignments will be accepted up to 48 hours after the due date, but with a 10% late penalty;
assignments that are received more than 48 hours late will not be accepted and will receive a failing grade.
- Please make arrangements with the instructor if you have a disability that requires
specific attention.
- Please contact the instructor if you need an accomodation regarding to the religious holidays by the end of
the second week of the semester. Instructors are expected to give students the opportunity to do
appropriate make-up work that is intrinsically no more difficult than the original exam or assignment. (Source:
Indiana University Academic Guide).
- We take academic integrity very seriously. You are required to abide by the Indiana University policy
on academic integrity, as described in the IU Code of Student Rights,
Responsibilities, and Conduct and the
Computer Science Statement on Academic Integrity.
In short, you must credit all sources, you must write your own code, you must not cheat or plagrize.
Grading policy
- Lab assignments 10%
- Class pop quizzes 20%
- Assignments 40%
- Final project 30%
Lab assignments will include programming exercises through HackeRrank. These are individual exercises.
Pop quizzes will consist of multiple-choice and short answer
questions and they will review course material.
Assignments will require you to write code with Python.
Final Project will be building a complete web search engine and an accompanying report.
*You will get an F grade if you fail to submit the final project.
*You may work with a partner on assignments and the final project.
*GitHub activities will affect your grade, especially if you are working in a team. Remember to check in!
Acknowledgment
This course's material is built upon the past course offered by Prof. David Crandall. I would like to thank
him for sharing his materials as well as his guidance.