Graduate Project
- Introduction
- Modern Book and Research Papers on Machine Learning
- Project
- Project Proposal and Group Formation
- Background
- Step-by-Step Instructions
- 0. Schedule a Meeting with the Course Instructor
- 1. Understand the Project Scope
- 2. Identify Key Areas
- 3. Search for Literature
- 4. Collect Relevant Papers
- 5. Focus on Datasets
- 6. Organize Your Findings
- 7. Review the Literature
- 8. Synthesize the Information
- 9. Write the Literature Review
- 10. Peer Review and Revise
- 11. Final Formatting and Submission
- Step-by-Step Instructions
- Implementation Demonstration and Followup Questions
- Final Writeup with Results
Introduction
Students in the graduate version of the course will complete 2 additional assignments.
- A modern book on machine learning approved by the instructor (ie. Genius Makers, The Alignment Problem, Human Compatible)
- An original group project
Modern Book and Research Papers on Machine Learning
The field of machine learning is constantly evolving but has, at this point, a considerable history. There are several wonderful books on machine learning that provide broad insight acquired from a lifetime of work in machine learning.
The course staff will be reading “Algorithms to Live By” By Brian Christian and Tom Griffiths. Other options are “Genius Makers”, “The Alignment Problem”, “Human Compatible”, and “Weapons of Math Destruction”. Other books not on this list are allowed, but must be approved by the instructor.
There will be an afternoon discussion about the reading to be scheduled in the first half of the semester over donuts and coffee. The discussion will be centered around machine learning history, modern problems in machine learning, and the future of machine learning.
5220 students will be required to attend. Any 4220 students who choose to make use of the available extra credit by reading one of the books will also be invited!
Project
Students in the graduate course will form into groups, and then propose and conduct an original project that utilizes concepts from both machine learning and data science. Projects related to climate data, natural language processing, or machine vision are encouraged.
There will be 4 milestones with deadlines to be released soon - (1) proposal and rubric, (2) background literature, (3) implementation demonstration and follow up questions, and (4) evaluation.
Project Proposal and Group Formation
This project is intended to be a self-directed problem-based exploration of a question(s) that is mutually interesting to the group. Groups should be between 3 to 5 students and are not set in stone until the project proposal is submitted.
So, there are two ways to go about forming a group to complete the project proposal.
- Choosing a group based on those with whom you enjoy working.
- Forming a group around a question that is important to you and others.
In research, I follow both paths. There are those that I work with because I like to work with them (this is usually the case for projects with those I already know). Then, there are those that I work with because a problem is mutually very compelling (often this is how I meet new colleagues!). Both paths can be very successful.
Regardless of how the group forms, the team must formally decide what questions they want to try to answer. The constraints are that the question should be addressable with data that is believably possible to acquire. So, answering the question, “is there an alien outpost on the dark side of the moon” would be challenging to address without a great source for satellite data or a moon mission.
Referring to the data science life-cycle, the project proposal requires the team to define the questions and to have a plan on how/where they will obtain relevant data. The source of data could be public repositories, hand gathering, or anywhere in-between.
Additionally, the project proposal should have a section in which the team discusses what success involves. For instance, it may not be possible to decide if homes under powerlines pose a risk in general. But, would it be a success if they could show that the results from previous work on leukemia rates were statistically related to the age of the power line? I would think yes. So, the team should identify subgoals that each contribute to success. In this way, even if the question can’t be directly answered, the project is still successful.
The subgoals are not set in stone. As you explore the data, you may ask additional questions (indeed you should). These can be added to the subgoals. However, the big picture, motivational question will be set in stone after the project proposal is approved by the instructor.
The proposal (and all written documentation for the graduate proejct) should be written in the AAAI format. I recommend using overleaf to prepare it. There’s a tutorial on overleaf linked on the resources page.
Background
The goal of the background literature review is to provide a comprehensive understanding of the current state of research related to your project topic, with a particular focus on the datasets and algorithms used by other researchers. This will help define your project scope, identify gaps, and highlight the significance of your research question.
The literature review should include at least 10 highly relevant papers.
Step-by-Step Instructions
0. Schedule a Meeting with the Course Instructor
- Set up a meeting with your course instructor to discuss your project proposal and gather feedback.
- Be prepared to present your preliminary literature review findings.
- Use the feedback to refine both your project scope and literature review.
1. Understand the Project Scope
- Review your project’s proposal and the defined research questions.
- Make sure you clearly understand the data science and machine learning concepts your project involves.
2. Identify Key Areas
- Based on your project’s theme (climate data, natural language processing, machine vision, etc.), determine the primary areas of literature you need to explore, focusing especially on the datasets used in similar research.
3. Search for Literature
- Use academic databases such as Google Scholar, IEEE Xplore, PubMed, ACM Digital Library, and Perplexity.ai.
- Include keywords related to your project’s focus area, specific subtopics, and datasets.
- Consider recent publications (last 5 years) to ensure your review is up-to-date.
4. Collect Relevant Papers
- Select papers that are seminal, highly cited, directly related to your project, and those that use specific datasets relevant to your research.
- Aim for a mix of survey papers, journal articles, and conference proceedings.
5. Focus on Datasets
- Pay special attention to the datasets used in the studies you collect.
- Note the types of data (e.g., images, text, time-series), sources of the data (public repositories, custom collections), and any data preprocessing steps mentioned.
- Look for any common datasets that are frequently used within your research area.
6. Organize Your Findings
- Use a reference management tool (e.g., Zotero, Mendeley) to keep track of the papers and their citations.
- Categorize the papers based on themes such as methodologies, datasets used, findings, and gaps identified by previous researchers.
- Create a separate section or annotation for details about the datasets.
7. Review the Literature
- Summarize key findings of each paper, with special emphasis on dataset descriptions and uses.
- Compare methodologies, datasets, and results.
- Identify Gaps: Look for areas that have not been explored or where results have been inconclusive, especially concerning datasets.
- Highlight: Emphasize how your project will address these gaps and leverage the mentioned datasets.
8. Synthesize the Information
- Create a narrative that connects the individual pieces of research, focusing on how the datasets have influenced previous findings.
- Discuss how the existing literature helps frame your research question and methodology, particularly with respect to data sources.
9. Write the Literature Review
Follow the AAAI format for writing your literature review. Structure your review as follows:
Introduction
State the purpose and scope of the literature review.
Thematic Sections
Organize the main body into logical sections based on themes (e.g., previous work in climate data analysis, NLP techniques for sentiment analysis, advancements in machine vision), with sub-sections on datasets.
Datasets Section
Include a dedicated section discussing the datasets used in the reviewed papers.
Comparison and Discussion
Highlight the strengths and weaknesses of the existing research, focusing on dataset availability and quality.
Conclusion
Summarize the key insights and how they relate to your project.
10. Peer Review and Revise
- Exchange drafts with group members for feedback.
- Make sure the literature review is clear, coherent, and sufficiently covers the relevant research areas.
11. Final Formatting and Submission
- Ensure the document adheres to AAAI formatting guidelines, including citations and references.
- Use Overleaf to prepare and format your document as recommended.
- Submit the literature review by the deadline specified for milestone 2.
Implementation Demonstration and Followup Questions
In the project proposal, a number of questions and models have been proposed. Now, it is time to design and conduct appropriate experiments to answer these questions.
There are two items that need to be completed in parallel.
- Item 1 - the written documentation of the experimental methodology.
- Item 2 - the experimental implementation.
The written methodology doesn’t need to be presentable (yet). It’s purpose is so that you can remember the decisions and reasoning which were made for answering each question. One way to accomplish this is through structured usage of a jupyter notebook that mixes code and text cells. The text cells explain the question and the methodology while the code conducts the experiment.
As this process unfolds, you will likely find questions that are answerable and interesting which you didn’t define beforehand. For instance, if it were found that a language model were able to successfully control a robot arm to complete a task, a natural next question would be how large does the model need to be? Strategically, choose whether to pursue these followup questions based on time and the overall impact on the project’s goal.
The methodology and implementation document (ideally a well organized set of jupyter notebooks) should be submitted to this assignment dropbox.
If you have questions regarding appropriate statistical methods for testing certain hypotheses or training certain models, please ask!
Final Writeup with Results
Finally, after conducting the experiments and obtaining results, create a well structured document that joins the proposal, the background, the methodology for answering each question, and the results into a single paper that is between 4 and 8 pages in length.
Update each section to reflect the final decisions of the project including the inclusion of any followup questions. The results section should present information in the most appropriate method for the type of result. For instance, do not list numeric values for 5 models in the body of a paragraph - a table would be more appropriate. Do not list values in a table when a graph or figure would be more appropriate.
The final two sections in the writeup should be a section called Conclusions and Limitations, respectively. In the conclusions section, provide a concise restatement of the questions investigated and answers obtained. In the Limitations section, point out the ways in which the answers to your questions are limited in applicability by the chosen dataset(s), methodology, and/or results.