General Course Info


  • Instructor:
    Roberto Corizzo[rcorizzo@american.edu]
  • First Class: Aug 26
  • Location: DMTI 116
  • Office Hours:
    Schedule a time to meet with me through Acuity

Course abstract

Nowadays we are witnessing the continuous generation of a huge volume of data from heterogeneous sources. Managing, processing, and analyzing such data becomes challenging and requires the adoption of high-performance computing frameworks and techniques. The course discusses the performance bottlenecks in traditional data processing and analytical tools and techniques and presents the opportunities to design scalable solutions leveraging distributed cluster environment architectures. Emphasis is put on the Hadoop and Spark frameworks, with practical examples of big data processing and analytical tasks in real-world applications.

AU Core Quantitative Literacy II (Q2) Outcomes:

  1. Translate real-world questions or intellectual inquiries into quantitative frameworks.
  2. Select and apply appropriate quantitative methods or reasoning.
  3. Draw appropriate insights from the application of a quantitative framework.
  4. Explain quantitative reasoning and insights using appropriate forms of representation so that others could replicate the findings.

PySpark and Databricks

In this course, we will be using the Python programming interface to the Apache Spark framework (PySpark), in combination with Databricks. You will be able to write and execute PySpark code directly in your browser, without worrying about standalone configuration, and leveraging free access to Databricks' powerful cloud infrastructure. This structure also facilitates collaborative coding.

Our course has joined the Databricks University Alliance, an active community of professors and educators who collaboratively share ideas to improve the teaching experience and provide students with most recent and relevant developments in terms of data science tools and concepts adopted in the industry. In order to start using Databricks, you can set up a free personal account (Community Edition). This option allows you to use Databricks on Amazon AWS for free. Instructions to sign up will be provided in the course.

Course Schedule

Date Topic Module Deadlines
Week 1
Tue., Aug. 26 Introduction to HPC S1
Fri., Aug. 29 Big Data Analytics / NoSQL S2 - S3
Week 2
Tue., Sep. 2 Hadoop I S4 Install VM
Fri., Sep. 5 Hadoop II S4
Week 3
Tue., Sep. 9 Spark: Intro, RDDs & Databricks Platform S5 Create Databricks Account
Fri., Sep. 12 Spark: Dataframes S5
Week 4
Tue., Sep. 16 Spark: Transformations S6
Fri., Sep. 19 Spark: Internals S7
Week 5
Tue., Sep. 23 Midterm Exam I Pool of Papers Release
Fri., Sep. 26 Spark: Structured Streaming / Delta Lakes S8
Week 6
Tue., Sep. 30 Spark: ML and MLlib / Linear Regression S9
Fri., Oct. 3 Spark: MLflow / Decision Trees ** Family Weekend ** S10
Week 7
Tue., Oct. 7 Spark: HyperOpt / Random Forest S11
Fri., Oct. 10 Fall Break (no class) /
Week 8
Tue., Oct. 14 Spark: HyperOpt / AutoML / Gradient Boosted Trees S12
Fri., Oct. 17 Spark: Distributed k-Means / Pandas UDF S12
Week 9
Tue., Oct. 21 Spark: Logistic Regression / Collaborative Filtering S12
Fri., Oct. 24 Time Series Forecasting
Graph Analysis: GraphX / Case Study / GraphFrames
S13
Week 10
Tue., Oct. 28 Spark: Deep Learning I S14
Fri., Oct. 31 DL II: Hyperopt / Transfer Learning / One-Class S15 Paper critiques submission
Project Assignment
Week 11
Tue., Nov. 4 DL III: Horovod S16
Fri., Nov. 7 DL III: Model Interpretability S16
Week 12
Tue., Nov. 11 Spark: NLP
Invited Guest
S16
Fri., Nov. 14 Spark: ML Deployment / ML in production S17 / S18 Project: Data Exploration / Pre-processing
Week 13
Tue., Nov. 18 Midterm Exam II
Fri., Nov. 21 Spark: Performance Optimization I
Invited Guest
S19 Project: Modeling
Week 14
Tue., Nov. 25 Spark: Performance Optimization II (Online) S19
Fri., Nov. 28 Thanksgiving Week (no class)
Week 15
Tue., Dec. 2 Paper Presentations
Fri., Dec. 5 Paper Presentations Project submission

Syllabus

Grading

CSC-483


Component Weight
Midterm Exams (2) 50% (2 x 25%)
Research Papers: 2 critiques + 1 presentation 20%
Final Project 30%


CSC-683


Component Weight
Midterm Exams (2) 40% (2 x 20%)
Research Papers: 3 critiques + 1 presentation 20%
Final Project 40%

Attendance

Students are recommended to attend all lectures. Prolonged absences must be discussed with the instructor. If you cannot attend lectures regularly, due to work or other obligations during remote learning, then please reach out to the instructor so that I know about it.


Exams

Exams cover the material from the lectures, projects, and reading. While not necessarily cumulative, each exam will require understanding many of the concepts covered in the preceding exams. Exams consist of multiple choice, short answer, and long answer questions. Each exam, except the final, is weighted equally.

For the Final Project, students will propose their own topic in consultation with the instructor.


Late Submissions

A penalty of 5% per day will be levied. The course doesn’t grant extension on the homework/lab/project submission deadline unless you have an extremely compelling excuse as observance of a religious holiday (in which case you need to let me know in advance).


Letter Grades

Range Letter
>=93 A
>=90 A-
>=87 B+
>=83 B
>=80 B-
>=77 C+
>=73 C
>=70 C-
>=60 D
<60 F

Academic Integrity

Even though we encourage collaboration with a partner, sharing code between groups is strictly forbidden - this is a form of plagiarism. As is showing your work to other students, even just for a second. There is rarely one single correct way to write code that solves a problem. While we want you to feel free to discuss your approach freely with a partner, you should know that there are often many solutions for a given problem and it's typically obvious when one student shares code with another. If you directly copy and paste code from the Internet (or even the text), cite your source in your comments (but also ensure that you understand what the code is doing - not all code on the web is good!). Assignments will be checked using plagiarism detection software and by hand to ensure the originality of the work.

Do not share your code with anyone other than a partner. Do not let someone look at your screen. You may get behind, or your friend may ask for help, but the consequences for plagiarism are far worse than an incomplete submission - for the submission, you will still likely get some points. If I suspect that you have purposely shared code with another student or presented someone else's work as your own, the matter will be referred to the Academic Integrity Code Administrator for adjudication. If you are found responsible for an academic integrity violation, sanctions can include a failing grade for the course, suspension for one or more academic terms, dismissal from the university, or other measures as deemed appropriate by the Dean.

All students are expected to adhere to the American University Honor Code. If you have a question about whether or not something is permissible, ask the instructor or the TA first.


Textbook

This course partially adopts the textbook "Learning Spark", 2nd Edition published by O'Reilly Media, Inc.
The online version of the book may be accessible for free from AU’s online Library After selecting "O’Reilly Online Learning" from the list and logging in with your AU account, you should be able to search for the book by name, or try accessing it from this link.
Additional learning resources in the forms of slides, readings, and coding examples will be provided on Canvas throughout the course.




Acknowledgments

Course design by Roberto Corizzo at American University.

Special thanks to the Databricks University Alliance and for their educational and computational resources.

Thanks to Alex Godwin at American University for designing this syllabus template.