Curriculum

Unit 1: Ethical issues in biomedical data science (3 hours)

1.1 Ethical issues in biomedical data science: Part 1

This section introduces challenges to traditional research ethics posed by new data sources and analytic methods. Topics include the conceptual boundaries of ethical research in the context of biomedical data science and how traditional frameworks adapt or fail in addressing these challenges. The goal is to expand the learner’s awareness of ethical dilemmas and equip them to engage critically with research scenarios that test normative boundaries.

Learning Objectives:

  1. Describe contemporary challenges to traditional ethical practices for responsible research (e.g., reproducibility, systemic racism, ownership).
  2. Discuss 2-3 strategies for addressing these challenges such as self-reflection, community engagement, and transparency practices.

Assessment Instrument:

Students discuss a case study considering the use of genetic data in JHS and write a short synopsis and reflection on the discussion.

1.2 Ethical issues in biomedical data science: Part 2

This section builds on the foundation of ethical theory to introduce specific theoretical frameworks that are commonly applied in the assessment of biomedical research ethics. It discusses how these frameworks are employed in practice and how they can guide decision-making in ambiguous situations.

Learning Objectives:

  1. Define 2-3 prevalent theoretical frameworks in bioethics guiding decisions of “right” and “wrong” (e.g., principlism, utilitarianism, etc.).
  2. Describe and analyze key ethical considerations for longitudinal studies, data science, and AI such as privacy, consent, and fairness.

Assessment Instrument:

Students write a short memo in a simulation exercise that demonstrates application of ethical reasoning

Unit 2: Foundations of Data in Biomedical Research (All about Data: Data Management, Representation, Metadata, and Data Sharing- 7 hours)

2.1 Introduction to the Jackson Heart Study

This topic presents a detailed overview of the Jackson Heart Study, focusing on its design, objectives, and contributions to public health. It contextualizes the importance of large-scale longitudinal studies in understanding cardiovascular disease within specific populations.

Learning Objectives:

  1. Describe the JHS Study Design – Explain the purpose, population, and structure of the JHS, including its major exams.
  2. Summarize Data Collection Methods – Identify the types of data collected (e.g., clinical, survey, genetic) and the methods used in different study phases.
  3. Interpret Key Variables and Codebooks – Understand how to use JHS codebooks to find variables relevant to specific research questions.

Assessment Instrument:

  1. Describe in your own words the purpose of the JHS, cohort characteristics, and exam waves.
  2. Describe how JHS collects specific data (e.g., CAC scores, lipid tests) and potential biases in data collection.
  3. Try to answer the following question: “Association between hysterectomy and cardiovascular disease in the JHS, adjusting for covariates” by locating relevant variables in the JHS codebook. Describe the variables chosen and their rationale.

2.1 The Process of Manuscript Development in the Jackson Heart Study

This section explains the formal process of manuscript development within the Jackson Heart Study framework. It covers procedural steps from proposal to publication, highlighting the importance of transparency and collaboration.

Learning Objectives:

Explain Data Access Procedures – Describe the process for requesting and obtaining JHS data, including data use agreements and ethical considerations

Assessment Instrument:

  1. In your own words, enumerate the steps involved in the process for developing a JHS manuscript.
  2. Using the information acquired from the lecture, draft a mock JHS manuscript proposal, using the Manuscript Proposal Form provided and the sample manuscript proposal.

2.2 Data About Data: Maximizing data and code for broad uses

This section examines how metadata and standardized practices enhance the reproducibility of scientific research. It introduces the basic categories of metadata and the role of data documentation in transparent research practices.

Learning Objectives:

  1. Understand standard components of metadata on biomedical science research datasets: how the data were collected, on what population, under what circumstances, etc.
  2. Learn to distinguish between good and bad metadata for reproducibility.

Assessment Instrument:

List 5 critical metadata categories that researchers need to know when reproducing findings. From the Diabetes and SES article you read by Jie Hu and PDSA codebook documents, what metadata are provided about how the Socioeconomic data were collected?

2.3 Data representation

This section explains how data representation influences the inferences that can be drawn and the interpretations users form. Choices made during quantization, tagging, or restructuring can introduce biases and affect the reproducibility of analyses. Researchers must consider the implications of these decisions from the early stages of the data pipeline through to result presentation.

Learning Objectives:

  1. Understand that the same data can be represented in many ways.
  2. Appreciate that each representation choice makes some tasks easier, but others more difficult.
  3. Learn how to choose a good representation for the task at hand.

Assessment Instrument:

For the scenario specified in the PDF, choose between specified representation choices. Explain the choice by stating benefits and drawbacks for each option considered. Estimated time: 10 min.

2.4 Data Sharing 101

This topic discusses the fundamental goals of open science and the practical constraints that limit data sharing in biomedical contexts. It presents principles for sharing while preserving privacy and complying with legal and ethical guidelines. Attention is also given to infrastructures and policies that shape how data can be accessed and reused.

Learning Objectives:

Appreciate the foundations of Open Science:

  1. NIH Data Management & Sharing Policy – Rationale and Key components
  2. FAIR Guiding Principles – Definition and examples

Assessment Instrument:

Define rationale behind NIH Data Management & Sharing requirement; list 5 key components you should include in your 2-page Data Management Plan; list the 4 FAIR principles.

2.4 Data Sharing – The Reality

Expanding on the theoretical principles, this section presents case studies that reveal practical challenges of sharing biomedical data. It explores frameworks like the Five Safes and highlights real examples of trade-offs between openness and confidentiality.

Learning Objectives:

Learn about privacy/confidentiality concerns related to Open Science and data sharing. Articulate difference in types of biomedical research (bench science, human clinical trials, animal models) and what implications data sharing has for each.

Assessment Instrument:

For the 3 examples of biomedical research provided, learners will articulate the promises & pitfalls of data sharing for reproducibility.

Unit 3: Rigorous statistical design (5.5 hours)

3.1 Developing a study design for a research aim

This session emphasizes the initial planning phase of a research study. It covers defining the hypothesis, specifying variables, and planning sampling strategies. The aim is to ensure students are capable of structuring research that allows credible statistical inference.

Learning Objectives:

Using a research aim provided by a health researcher, the student should be able to discuss one or more study designs that rigorously address the stated research aim. The student should be able to discuss the strengths and weaknesses of alternative study designs in terms of rigor, as well as in terms of practical aspects of implementation and execution.

Assessment Instrument:

Glomerular Filtration Rate (GFR) is a measure of kidney performance, it is the rate (in mL per minute) at which the kidneys filter the blood. Higher values of eGFR indicate better kidney performance. GFR declines with age, but there are differences between individuals in the rate and shape of the GFR trajectory. Low GFR is a risk factor for both heart failure and kidney failure. Given this context, propose a study design that would be able to characterize the diversity of GFR trajectories in a population of interest, and also quantify the extent to which low GFR associates with heart failure.

3.2 Developing an analytic plan for a study design

Once a study is designed, the analytic strategy must align. This includes specifying hypotheses, choosing statistical tests, and considering data limitations. Emphasis is on pre-specification to reduce analysis bias.

Learning Objectives:

For a given research aim and study design, the student should be able to develop an analytic plan that could be used to obtain scientifically rigorous insight about the aim of interest. The analytic plan should be accompanied by a power analysis that allows the researcher to anticipate what they will be able to discover, and what they are likely to miss when carrying out the research. Power assessment should include discussion of the precision with which parameters of interest will be estimated, and the effect sizes that can confidently be detected. The student should be able to justify that their assessment of statistical power reflects the real-world reproducibility of the analytic findings. The student should be able to devise an analytic plan employing appropriate statistical methods that address the structure of the study design.

Assessment Instrument:

Based on the study design that you proposed in part 3.1, develop an analytic plan that can be used to provide quantitative evidence regarding the stated research aims. Then conduct a power analysis that supports your study design decisions.

3.3 Interpreting findings from empirical research

Students learn how to critically assess empirical results, focusing on bias, confounding, effect modification, and interpretation limits.

Learning Objectives:

Given a statement of research aims, a cohort study, and analysis plan for an observational study, students should be able to identify specific risks for bias, uncertainty, and non-reproducibility due to the observational nature of this study. In addition, students should be able to propose some elementary remedies for these challenges.

Assessment Instrument:

Produce realistic mock results from the study design and analytic plans developed in parts 3.1 and 3.2. For these mock results, interpret the findings in relation to the provided research aim. Be sure to consider your confidence in the results, and how aspects of the study design may influence bias and precision that in turn impact your conclusions.

Unit 4: Predictive Modeling (6 hours)

This section focuses on aspects of data modeling, feature engineering, clustering and classification (as part of predictive modeling). Along with discussing a case study, we will review rubrics for standardized reporting of predictive models.

4.0 Pre-reading materials

This session focuses on ideas from basic statistics, and exploratory data analysis. It examines distributional checks, visualizations and data pre-processing.

Learning Objectives:

Students will be able to:

  1. Understand ideas from basic statistics, like probability distributions and hypothesis testing
  2. Understand ideas from basic exploratory data analysis techniques, including visualizations, hypothesis tests and missing data aspects
  3. Understand TRIPOD guidelines pertinent to this session

Assessment Instrument:

N/A. Please read the paper: https://www.sciencedirect.com/org/science/article/pii/S2291969425000225

4.1 Data preparation and fundamentals of data clustering

This section focuses on dimension reduction, feature engineering, and how these choices impact modeling (like clustering and classification).

Learning Objectives:

 Students will be able to:

  1. Understand TRIPOD guidelines for this subsection
  2. Understand dimension reduction approaches (like PCA, MDS)
  3. Their consequence on pre-processing techniques like feature selection and clustering/classification

Assessment Instrument: 15 mins

Describe the variables chosen and their rationale. Please populate items for this subsection of the TRIPOD-checklist.

4.2 Modeling tools (Classification)

Students are introduced to modeling techniques and frameworks for developing and evaluating classification models. Topics include classification, model fit accuracy metrics, and context-appropriate model selection.

Learning Objectives:

Students will be able to:

  1. Understand principles of classification modeling and metrics for their principled evaluation
  2. Understand Pitfalls and potential failure modes, including data leakage modes in predictive modeling
  3. Understand TRIPOD-AI guidelines (extension from statistical models to AI models)

Assessment Instrument: 30 mins

Please populate items for this subsection of the TRIPOD-checklist.

4.3 A review of predictive modeling, in context of a Case Study

The session connects prior topics in a coherent pipeline, including a case study integrating all these ingredients (feature selection, clustering, classification and experimental validation). It emphasizes reproducibility and the accurate communication of modeling choices.

Learning Objectives:

Students will be able to:

  1. Understand the overall workflow, along a demonstration case study from high throughput screening
  2. Understand overall TRIPOD guidelines and ingredients

Assessment Instrument: 1 hour

Complete the TRIPOD-AI checklist for the paper: https://www.sciencedirect.com/org/science/article/pii/S2291969425000225. Discuss what was missing in the reporting within the paper. Answer questions around intuition-building for these techniques.

Unit 5: Reproducible Workflows (5.5 hours)

5.1 Goals of Reproducible Analyses

This section introduces the principles underlying reproducible research. These include transparency, user accessibility, reusability, and long-term preservation. Emphasis is placed on organizing work for future verification and reuse by other researchers.

Learning Objectives:

Awareness of key challenges and goals when creating reproducible workflows, including making analyses reproducible user friendly, transparent, reusable, version controlled, and archived.

Assessment Instrument:

Write in your own words a summary of the key challenges and goals when creating reproducible workflows.

5.2 Reproducibility via Code Notebooks

This session presents tools like Jupyter and Quarto notebooks, which support narrative and code integration. Learners are introduced to markdown and interactive documentation.

Learning Objectives:

Awareness of markdown, Jupyter, quarto, and how these tools can be integrated into reproducible workflows.

Assessment Instrument:

Create a notebook that does some simple EDA on the data. It should generate at least one plot and at least one table. Create a script to download the simulated Jackson Heart Study Data from github. The script should be in a markdown file. Upload your analysis file.

5.3 Best practices for Reproducible Programming

Students learn software engineering techniques that support reproducibility. These include the DRY principle, modular code, and use of configuration files.

Learning Objectives:

Awareness of best practices for reproducible programming including writing scripts, functions, avoiding magic numbers, caching and seeding randomness, and how to refactor code to align with these practices.

Assessment Instrument:

Add documentation to a simple EDA notebook. Create a makefile similar to the one shown in class. E.g., with commands to download the data, run the analysis, etc. Upload your analysis file.

5.4 Version Control

This section explains version control systems like Git. Students learn how to track changes, manage branches, and collaborate efficiently.

Learning Objectives:

Familiarity with git and its benefits, and the ability to begin using it for simple tasks, including cloning, committing changes, pushing and pulling.

Assessment Instrument:

Put the source code for a template project on github. Submit the link to your github page. Describe any challenges you encounter.

5.5 Containers

Students are introduced to containerization tools that allow environment replication across machines. Tools include Docker and Singularity.

Learning Objectives:

Familiarity with various tools for dependency management, including python virtual environments, renv, and containerization, and their respective strengths and weaknesses. Ability to create and run simple docker images.

Assessment Instrument:

Put a template project into a Docker image, and ensure that it runs as expected. Upload the dockerfile and other source files to the github project. Describe any challenges you encounter.

5.6 Assembling a full analysis pipeline

This session brings together all prior elements into a complete reproducible analysis pipeline. Topics include automation, archiving, and integration of documentation.

Learning Objectives:

Considerations when organizing an analysis pipeline, and the ability to assemble a full template pipeline.

Assessment Instrument:

Create a template project. It should have a directory structure similar to the one shown in class (e.g., with a “data” directory, “analysis” directory, etc.) and include a few simple markdown files. Push the full analysis to github. Describe any challenges you encounter.

Unit 6: Meta-analysis (3.5 hours)

6.1 Key concepts in research synthesis and integration of statistical evidence

This section presents the principles of research synthesis. It explains how systematic reviews and meta-analyses combine evidence across studies, and how variation between studies can be understood.

Learning Objectives:

The students will understand the purpose of meta-analysis, and what information is needed to conduct a meta-analysis. They will understand the differing roles of meta-analyses based on complete data versus summary data. In terms of foundational ideas, they will understand:

  1. The difference between statistical uncertainty and variation
  2. The basic mathematics behind pooling p-values, standard errors, and confidence intervals from independent sources
  3. How to weight estimates with different precisions to produce an optimal pooled estimate

Assessment Instrument:

The assessment is contained in the Jupyter notebook at https://github.com/kshedden/workshops/blob/main/dair3/meta/exercises/exercises%20without%20solutions.ipynb

6.2 Adjusting for heterogeneity

The session introduces statistical methods for heterogeneity. Learners will see how to interpret between-study variation and use modeling strategies to adjust for known differences.

Learning Objectives:

Students will understand how variation in measured effect sizes can be partitioned into the component attributable to statistical variation, and the component attributable to heterogeneity in the true effects. They will also understand how stratification and regression can be used to estimate a consensus effect size and significance level from studies with heterogeneous designs.

6.3 Accounting for non-independence and network effects

This topic explores how meta-analyses can incorporate dependencies between studies and subjects. Emphasis is on modeling frameworks like network meta-analysis.

Learning Objectives:

The students will understand how non-independence of research results impacts research synthesis, and will be able to identify possible sources of non-independence. They will be able to employ multilevel regression to account for non-independence when conducting a meta-analysis.

Unit 7: Transformer-based AI in Biomedical Research (3 hours)

7.1 The basics of LLMs

This session presents the core mathematical ideas behind large language models (LLMs) such as transformers. Topics include attention mechanisms and comparison to earlier architectures.

Learning Objectives:

Participants will gain a broad understanding of the theoretical aspects of transformer models. They will be able to select a tool adequate for a task. This session will present examples about simple Artificial Neural Networks (ANNs) and simple transformer models. How transformers become Large Language Models (LLMs).

Assessment Instrument:

In your own words, explain: What are word embeddings? Comment on the difference the number of parameters in a LLM makes. What LLM would you choose for the development of an analytical pipeline among several options including Llama 7B, Llama 30B, GPT 4o, Claude 3.5, and Foundational models. Refer to the Reference Guide in the lesson’s PDF section. Estimated time: 15 min.

7.2 Consensus in LLMs

This session focuses on the generation of stable outputs from stochastic models. It introduces strategies to build ensemble outputs and improve reproducibility of LLM responses.

Learning Objectives:

Participants will gain understanding on how consensus works in LLMs using multiple systems, e.g. GPT, Anthropic’s Claude, etc. However, the emphasis will be on application programming interfaces as opposed to web-based apps.

Assessment Instrument:

Pre-workshop tasks included setting up the environment for local access to LLMs via APIs. Execute a consensus framework either manually between browsers or programmatically with the source code provided. Ask multiple LLMs about routes of analysis for the data of the JHS. Refer to step-by-step tutorial in the lesson’s PDF section. Estimated time: 30 min.

7.3 LLMs in Biomedical Research

This session connects LLMs to domain-specific applications. It includes discussion of data annotation, prompt engineering, and evaluating outputs.

Learning Objectives:

Participants will create a consensus analysis pipeline using LLMs to complete a technical task. A step-by-step template is provided. Participants will be able to expand this template to complete their assignment.

Assessment Instrument:

Create an LLM consensus analysis pipeline to:

  1. Generate source code based on data files, for data ingestion, sanitization, quantitative analysis, and presentation of results using ABBA in Form 1 from the JSH, or
  2. Prepare a proposal for the JHS using guidelines and a research idea via LLMs; participants will not write text directly.

The outcome is either a technical report that is produced as the result of executing the analysis pipeline, or the draft of a proposal for JHS. Refer to step-by-step tutorial in the lesson’s PDF section. Estimated time: 1 hour.