Project description
MATH 248 Project
Due dates:
Group sign-up is due Wednesday, October 14 by 11:59 pm.
Group Charter: Friday, October 16 by 11:59 pm.
Initial proposal: Friday, October 23 by 11:59 pm.
Meetings: on Fridays between October 30 and November 13 during class time
Project outcomes: video presentation + info sheet are due on Wednesday, November 18 by 11:59 pm.
Letter of Learning: Friday, November 20 by 11:59 pm.
Project discussion: Friday, November 20 by 1:00 pm.
In groups of 2, or individually, you will research a statistical modeling technique or topic which has not been presented in class and won’t be covered in this class. You will need to apply that method to a particular question/dataset of your choosing. I provide potential topics and data sources below but you are not restricted to my suggestions.
Working in Groups/Group Sign-up
You are encouraged to work in groups of up to two people. However, if you decide to work in a group you will need to:
1. Declare your groups here Links to an external site. by Wednesday, October 14.
2. Fill out and sign the Group Charter Links to an external site.: due on Friday, October 16 (submit on Canvas. Make sure that you submit the pdf version of the file)
3. Write a description of what each individual student in the group contributed to the project in your letter of learning. Due by Wednesday, November 18 by 11:59 pm
All members of the group will receive the same grade for the project.
As one of the outcomes of this project, your group will record a short presentation of your modeling technique or topic to the class. The presentations should be approximately 10 minutes long. Presenting time should be distributed evenly between group members or the overall group grade will be penalized.
What should the presentation include?
In general, your presentation should include the following sections.
- Introduction. An explanation of the data and questions related to the data. What problems or questions did you set out to investigate? How were the data collected?
- Methodology. Explain the statistical modeling technique. When is it used? What are the model assumptions? Can you put it in context with other techniques we have learned in class? (In other words, why do we need a new technique?)
- Results and conclusions, the summary, and presentation of your data analyses. What did you find out? This might include tables, graphs, or verbal summaries.
- Discussion and critique. What did you learn about the problem or question you set out to investigate? What were the weaknesses and strengths of your analysis and this method?
- Info sheet. You must also produce a one to two-page information sheet (R markdown file) about how to fit your modeling technique in R. I would think of the information sheet as a “help page” that is much more helpful than the usual R help pages. The information sheet should include, but is not limited to:
- A brief description of the modeling technique
- Other names of the technique
- Useful R code applied to a dataset and an explanation of the code
Timeline
You need to submit your initial proposal to me on Friday, October 23 by 11:59 pm. You need to include the group members, project topic, and area of application. No two groups can work on the same topic and preference will be given to the group which submits their topic first. Several groups can look at the same application area but no two groups can use the same dataset.
Each group will have a meeting with me to discuss the project. The meeting will be held between October 30 and November 13 and a sign-up sheet for the meeting is available here Links to an external site.. The group should come to this meeting prepared to discuss the work they have done, any questions they have, and how they plan to formulate their presentation. For the meeting, the group must bring an outline of the presentation. Drafts of presentation slides and topic info sheet are advisable but not required.
Potential Topics
Below is a list of potential topics and at least one reference to getting you started. You do not have to pick a topic from this list. If you pick a different topic, don’t start working on your project until I have responded to your initial proposal because I may ask you to refine or modify your chosen topic.
- Poisson regression: Poisson regression is a generalized linear model where the response variable represents a count. - References: Chapter 8 of Practicing Statistics by Kuiper and Sklar; Chapter 4 of Analysis of Categorical Data with R, Bilder, Loughin
- Tree-based methods: Regression and Classification trees. The general idea is that we will segment the predictor space into a number of simple regions. - References: Ch 11.4 of Applied Linear Statistical Models by Kutner, Nachtsheim, Neter, and Chapter 8 of An Introduction of statistical learning by James et al.
- Support vector machines. support-vector machines (SVMs) are supervised learning models with associated learning algorithms that analyze data used for classification Links to an external site. and regression analysis Links to an external site.. –References: Chapter 9 of An Introduction of statistical learning by James et al.
- Weighted Least Squares: A method to deal with the violation of constant variance. An increase in variance as the predictor increases is very common, especially in economics applications. - Reference: Chapter 11.1 of Applied Linear Statistical Models by Kutner, Nachtsheim, Neter, and Li
- Principal components regression. In PCR, instead of regressing the dependent variable on the explanatory variables directly, the principal components Links to an external site.of the explanatory variables are used as regressors. - Reference: Chapter 6 of An Introduction of statistical learning by James et al.
- Discriminant analysis. - Reference: Chapter 4 of An Introduction of statistical learning by James et al.
- Clustering: K-means or hierarchical. Unsupervised learning technique that allows us to group a set of objects in such a way that objects in the same group (called a cluster) are more similar (in some sense) to each other than to those in other groups (clusters). -Reference: Modern data science with R (Baumer, Kaplan, Horton)
- Robust Regression: Regression procedure, which tries to control the influence of outliers. May also want to discuss methods for detecting outliers in regression models. - Reference: Ch 11.2 of Applied Linear Statistical Models by Kutner, Nachtsheim, Neter, and Li
- Ridge Regression: Regression procedure, which tries to mitigate the impact of multi-collinearity.- Reference: Ch 11.3 of Applied Linear Statistical Models by Kutner, Nachtsheim, Neter, and Li
Not intrigued by any of my suggestions? There are several websites, such as fivethirtyeight.com and simplystatistics.com, which discuss ways of solving interesting data analysis questions.
Initial Proposal should include:
- Names of team members
- Modeling topic
- Research question
- Data source
Potential Data Sources Links to an external site.
The project is worth 35% of your final grade. Below is a breakdown of the grade:
- Initial Proposal (5%)
- Meeting (5%)
- Presentation (50%)
- Topic Info Sheet (20%)
- Involvement in your group (10%) Based on the letter of learning
- Engagement with presentations (10%) (i.e. asking good questions on discussion board)
Letter of learning:
This letter should summarize your experience during the project. You need to submit your individual letter of learning on Canvas by Friday, November 20. You will NOT receive a grade for the project until I receive your letter of learning. I reserve the right to take the information from your letter of learning and use it for your final project grade.
The letter of learning should include, but is not limited to the following:
- What was the most interesting/meaningful part of the project?
- What was the most challenging part of the project?
- What strategies have you employed in this project? What worked? What didn't?
- Include the description of the involvement/contribution of each of the team members (including yourself) in the project.