Curriculum - DAIR-3

Unit 1: Introduction: Responsible conduct of research (RCR), rigor and reproducibility in the complex world of biomedical data science (3 hours)

1.1 RCR in the context of biomedical data science – Ethical issues in biomedical data science: Part 1

This lesson examines the sociotechnical and ethical aspects of biomedical data science. Students will analyze sociotechnical systems, differentiate bioethical and other ethical approaches, and evaluate key ethical challenges in data science research and applications.

Learning Objectives:

Analyze the sociotechnical system of biomedical data science
Differentiate between traditional bioethical, sociotechnical, and other ethical approaches to data science research and applications
Evaluate key ethical challenges in biomedical data science

Assessment Instrument:

Select one of the following AI health applications:

A smartphone app that uses AI to analyze speech patterns to detect early signs of depression
An algorithm that predicts hospital readmission risk and prioritizes follow-up care
A wearable device that monitors vital signs and uses AI to detect potential health emergencies

Then answer the following questions:

What ethical concerns arise from the use of the technology?
List 2 domains beyond the technology itself that should be examined (use the sociotechnical framework presented). For each domain, describe specific ethical concerns that might arise.
Evaluate ethical issues related to data collection, representation, storage, and use. Discuss potential biases in the data and implications for equity and justice.

1.2 What are ethics? – Ethical issues in biomedical data science: Part 2

This lesson equips students to address ethical challenges in biomedical data science. Learners will identify strategies for ethical secondary data use, analyze stakeholder engagement approaches, and develop frameworks for ethical project review, emphasizing anticipatory governance and responsible data science practices.

Learning Objectives:

Identify and formulate approaches to address ethical issues in secondary use, including anticipatory governance principles
Analyze stakeholder engagement strategies for data science
Develop a framework for ethical review of biomedical data science projects

Assessment Instrument:

The assessment involves designing a governance framework for a data science initiative, selecting one of three projects: a hospital readmission algorithm, a health outcomes database for underserved communities, or a mental health chatbot for students. The framework must address stakeholder engagement, decision-making, monitoring, benefit-sharing, unexpected impacts, and consent, while evaluating ethical considerations, future challenges, and feasibility. The goal is to create an ethical, well-structured, and adaptable governance plan.

Unit 2: Foundations of Data in Biomedical Research (All about Data: Data Management, Representation, Metadata, and Data Sharing) (7 hours)

2.1 Data Management – Introduction to the Jackson Heart Study

This lesson provides an overview of the Jackson Heart Study (JHS), focusing on its design, data collection, and variable interpretation. Students will describe the JHS’s purpose, population, and exam structure, summarize clinical, survey, and genetic data collection methods across study phases, and learn to use JHS codebooks to identify variables for research questions.

Learning Objectives:

Describe the JHS Study Design – Explain the purpose, population, and structure of the JHS, including its major exams.
Summarize Data Collection Methods – Identify the types of data collected (e.g., clinical, survey, genetic) and the methods used in different study phases.
Interpret Key Variables and Codebooks – Understand how to use JHS codebooks to find variables relevant to specific research questions.

Assessment Instrument:

Describe in your own words the purpose of the JHS, cohort characteristics, and exam waves.
Describe how JHS collects specific data (e.g. CAC scores, lipid tests) and potential biases in data collection.
Try to answer the following question, “Association between hysterectomy and cardiovascular disease in the JHS, adjusting for covariates by locating relevant variables in the JHS codebook. Describe the variables chosen and their rationale.

2.1 Data Management – The Process of Manuscript Development in the Jackson Heart Study

This lesson covers the process for requesting and obtaining Jackson Heart Study (JHS) data. Students will learn to navigate data access procedures, including submitting manuscript or ancillary study proposals, completing data use agreements, and addressing ethical considerations to ensure responsible use of JHS data.

Learning Objectives:

Explain Data Access Procedures – Describe the process for requesting and obtaining JHS data, including data use agreements and ethical considerations.

Assessment Instrument:

In your own words, enumerate the steps involved in the process for developing a JHS manuscript.
Using the information acquired from the lecture, draft a mock JHS manuscript proposal, using the Manuscript Proposal Form provided and the sample manuscript proposal.

2.2 Metadata – Data About Data: Maximizing data and code for broad uses

This lesson explores the essentials of metadata in biomedical research datasets, covering data collection methods, population, and context. Students will learn to identify high-quality metadata that supports reproducibility and distinguish it from inadequate metadata, ensuring robust and reliable research outcomes.

Learning Objectives:

Understand standard components of metadata on biomedical science research datasets: how the data were collected, on what population, under what circumstances, etc.
Learn to distinguish between good and bad metadata for reproducibility.

Assessment Instrument:

List 5 critical metadata categories that researchers need to know when reproducing findings; From the Diabetes and SES article you read by Jie Hu and PDSA codebook documents, what metadata are provided about how the Socioeconomic data were collected?

2.3 Data representation

This lesson examines how data can be represented in multiple ways, highlighting that each representation impacts task efficiency. Students will learn to select optimal data representations tailored to specific research tasks, balancing ease and complexity.

Learning Objectives:

Understand that the same data can be represented in many ways.
Appreciate that each representation choice makes some tasks easier, but others more difficult.
Learn how to choose a good representation for the task at hand.

Assessment Instrument:

Suppose you have collected data using the PHF form in https://github.com/biomathematicus/DAIR-3/blob/main/ESF/PHF.pdf, choose a design for a relational table. Your table will hold data from hundreds of such forms as they are filled in and returned. List the column names you will use for your table. State at least one drawback of your chosen design. Estimated time: 10 min

2.4 Data Sharing – Data Sharing 101

This lesson introduces the principles of Open Science, focusing on the NIH Data Management & Sharing Policy’s rationale and key components. Students will explore the FAIR Guiding Principles (Findable, Accessible, Interoperable, Reusable), learning their definitions and practical examples to promote transparent and reproducible research.

Learning Objectives:

Appreciate the foundations of Open Science-

NIH Data Management & Sharing Policy – Rationale and Key components
FAIR Guiding Principles – Definition and examples

Assessment Instrument:

Define rationale behind NIH Data Management & Sharing requirement; list 5 key components you should include in your 2-page Data Management Plan; list the 4 FAIR principles.

2.4 Data Sharing – The Reality

This lesson examines privacy and confidentiality concerns in Open Science and data sharing. Students will learn to differentiate between biomedical research types: bench science, human clinical trials, and animal models—and understand the unique data sharing implications for each, including ethical considerations and strategies to protect sensitive data while promoting transparency.

Learning Objectives:

Learn about privacy/confidentiality concerns related to Open Science and data sharing.
Articulate difference in types of biomedical research (bench science, human clinical trials, animal models) and what implications data sharing has for each.

Assessment Instrument:

For the 3 examples of biomedical research provided, learners will articulate the promises & pitfalls of data sharing for reproducibility.

Unit 3: Rigorous statistical design (5.5 hours)

3.1 Principles of study design for empirical research – Developing a study design for a research aim

A practical workshop exploring how to select appropriate study designs for health research aims. Students will analyze research scenarios, compare methodologies based on scientific rigor and practical implementation considerations, and develop skills to justify design choices while acknowledging their limitations.

Learning Objectives:

Using a research aim provided by a health researcher, the student should be able to discuss one or more study designs that rigorously address the stated research aim.
The student should be able to discuss the strengths and weaknesses of alternative study designs in terms of rigor, as well as in terms of practical aspects of implementation and execution.

Assessment Instrument:

Glomerular Filtration Rate (GFR) is a measure of kidney performance, it is the rate (in mL per minute) at which the kidneys filter the blood. Higher values of eGFR indicate better kidney performance. GFR declines with age, but there are differences between individuals in the rate and shape of the GFR trajectory. Low GFR is a risk factor for both heart failure and kidney failure. Given this context, propose a study design that would be able to characterize the diversity of GFR trajectories in a population of interest, and also quantify the extent to which low GFR associates with heart failure. The study design should include brief sections on

statement of the scientific research aim
definition of the target population
how subjects will be recruited
schedule of baseline and follow-up assessments
what variables will be collected at each assessment
anticipated challenges including missing data and loss to follow-up

3.2 Analytic plans and statistical power – Developing an analytic plan for a study design

This lesson equips students with the skills to develop rigorous analytical plans aligned with specific research aims and study designs. Participants will learn to conduct comprehensive power analyses to predict discoverable effect sizes and estimation precision, ensuring findings are reproducible in real-world settings. The course emphasizes selection of appropriate statistical methods that account for study design structure and maximize analytical validity.

Learning Objectives:

For a given research aim and study design, the student should be able to develop an analytic plan that could be used to obtain scientifically rigorous insight about the aim of interest.
The analytic plan should be accompanied by a power analysis that allows the researcher to anticipate what they will be able to discover, and what they are likely to miss when carrying out the research.
Power assessment should include discussion of the precision with which parameters of interest will be estimated, and the effect sizes that can confidently be detected.
The student should be able to justify that their assessment of statistical power reflects the real-world reproducibility of the analytic findings.
The student should be able to devise an analytic plan employing appropriate statistical methods that address the structure of the study design.

Assessment Instrument:

Based on the study design that you proposed in part 3.1, develop an analytic plan that can be used to provide quantitative evidence regarding the stated research aims. Then conduct a power analysis that supports your study design decisions.

3.3 Sources of bias and causal interpretation of research findings – Interpreting findings from empirical research

Students will learn to identify potential sources of bias, uncertainty, and non-reproducibility in observational cohort studies, and suggest basic strategies to address these issues.

Learning Objectives:

Given a statement of research aims, a cohort study, and analysis plan for an observational study, students should be able to identify specific risks for bias, uncertainty, and non-reproducibility due to the observational nature of this study.
In addition, students should be able to propose some elementary remedies for these challenges.

Assessment Instrument:

Produce realistic mock results from the study design and analytic plans developed in parts 3.1 and 3.2. For these mock results, interpret the findings in relation to the provided research aim. Be sure to consider your confidence in the results, and how aspects of the study design may influence bias and precision that in turn impact your conclusions.

Unit 4: Design and reporting of predictive models (6 hours)

This section focuses on aspects of data modeling, feature engineering, clustering and classification (as part of predictive modeling). Along with discussing a case study, we will review rubrics for standardized reporting of predictive models.

4.0 Pre-reading materials

This session focuses on ideas from basic statistics, and exploratory data analysis. It examines distributional checks, visualizations and data pre-processing.

Learning Objectives:

Students will be able to:

Understand ideas from basic statistics, like probability distributions and hypothesis testing
Understand ideas from basic exploratory data analysis techniques, including visualizations, hypothesis tests and missing data aspects
Understand TRIPOD guidelines pertinent to this session

Assessment Instrument:

N/A. Please read the paper: https://www.sciencedirect.com/org/science/article/pii/S2291969425000225

4.1 Data preparation

Students will learn key dimension reduction techniques like PCA and MDS, and understand how these methods impact feature selection and clustering/classification in data preprocessing.

Learning Objectives:

Students will be able to:

Understand dimension reduction approaches (like PCA, MDS)
Their consequence on pre-processing techniques like feature selection and clustering/classification
Understand TRIPOD guidelines for this subsection

Assessment Instrument: 15 mins

Describe the variables chosen and their rationale. Please populate items for this subsection of the TRIPOD-checklist.

4.2 Modeling tools

This lesson introduces the core concepts of rigorous statistical design in empirical research, focusing on observational studies. Students will learn to plan and analyze studies, understand sources of bias and uncertainty, and apply methods such as stratification, regression adjustment, matching, and propensity scores to improve validity and reproducibility.

Learning Objectives:

Students will be able to:

Understand principles of classification modeling and metrics for their principled evaluation
Understand Pitfalls and potential failure modes, including data leakage modes in predictive modeling
Understand TRIPOD-AI guidelines (extension from statistical models to AI models)

Assessment Instrument: 30 mins

Please populate items for this subsection of the TRIPOD-checklist.

4.3 A review of predictive modeling

Students will gain an understanding of the high throughput screening workflow through a case study and learn the key components and guidelines of the TRIPOD framework.

Learning Objectives:

Students will be able to:

Understand the overall workflow, along a demonstration case study from high throughput screening
Understand overall TRIPOD guidelines and ingredients

Assessment Instrument: 1 hour

Complete the TRIPOD-AI checklist for the paper: https://www.sciencedirect.com/org/science/article/pii/S2291969425000225. Discuss what was missing in the reporting within the paper. Answer questions around intuition-building for these techniques.

Unit 5: Reproducible Workflows (5.5 hours)

5.1 Goals of Reproducible Analyses

Learn the key goals and challenges of creating reproducible, transparent, and user-friendly analyses that are easy to share and reuse.

Learning Objectives:

Awareness of key challenges and goals when creating reproducible workflows, including making analyses reproducible user friendly, transparent, reusable, version controlled, and archived.

5.2 Reproducibility via Code Notebooks

Gain awareness of Markdown, Jupyter, and Quarto, and learn how these tools integrate to create clear, reproducible workflows for data analysis and reporting.

Learning Objectives:

Awareness of markdown, Jupyter, quarto, and how these tools can be integrated into reproducible workflows.

5.3 Best practices for Reproducible Programming

Learn essential best practices for reproducible programming, including writing clear scripts and functions, avoiding magic numbers, using caching and seeding for randomness, and refactoring code to enhance clarity, reliability, and repeatability.

Learning Objectives:

Awareness of best practices for reproducible programming including writing scripts, functions, avoiding magic numbers, caching and seeding randomness, and how to refactor code to align with these practices.

5.4 Version Control

Gain a basic understanding of Git, its advantages, and learn to perform essential tasks such as cloning repositories, committing changes, and syncing with remote repositories using push and pull commands.

Learning Objectives:

Familiarity with git and its benefits, and the ability to begin using it for simple tasks, including cloning, committing changes, pushing and pulling.

5.5 Containers

Gain hands-on experience with key dependency management tools-Python virtual environments, renv, and containerization-understanding their pros and cons, and develop the skills to create and run basic Docker images.

Learning Objectives:

Familiarity with various tools for dependency management, including python virtual environments, renv, and containerization, and their respective strengths and weaknesses. Ability to create and run simple docker images.

5.6 Assembling a full analysis pipeline

Learn key factors in organizing an analysis pipeline and develop the skills to assemble a complete, reusable pipeline template.

Learning Objectives:

Considerations when organizing an analysis pipeline, and the ability to assemble a full template pipeline.

Assessment Instrument:

Describe your progress on the template workflow. What aspects did you find most confusing or challenging? Which tools (e.g., Git, Make, Docker) were hardest to implement, and why?
What’s one thing you plan to change or do differently in your own projects after today’s session? Give a specific example of an analysis or workflow improvement you intend to make.

Unit 6: Meta-analysis (3.5 hours)

6.1 Key concepts in research synthesis and integration of statistical evidence

This lesson teaches students to understand the purpose of meta-analysis and the types of data required, as well as the different approaches depending on whether complete or summary data are used. Students will learn to distinguish between statistical uncertainty and variation, grasp the basic mathematics of pooling p-values, standard errors, and confidence intervals from independent sources, and understand how to weight estimates with varying precision to produce an optimal pooled estimate.

Learning Objectives:

The students will understand the purpose of meta-analysis, and what information is needed to conduct a meta-analysis. They will understand the differing roles of meta-analyses based on complete data versus summary data. In terms of foundational ideas, they will understand:

The difference between statistical uncertainty and variation
The basic mathematics behind pooling p-values, standard errors, and confidence intervals from independent sources
How to weight estimates with different precisions to produce an optimal pooled estimate

Assessment Instrument:

The assessment is “assessment 1” contained in the jupyter notebook at https://raw.githubusercontent.com/kshedden/workshops/refs/heads/main/dair3/meta/exercises/exercises%20without%20solutions.ipynb

6.2 Adjusting for heterogeneity

Students will learn to separate effect size variability into statistical noise and true heterogeneity, then apply stratification/regression methods to estimate robust consensus effects and significance levels across methodologically diverse studies.

Learning Objectives:

Students will understand how variation in measured effect sizes can be partitioned into the component attributable to statistical variation, and the component attributable to heterogeneity in the true effects.
They will also understand how stratification and regression can be used to estimate a consensus effect size and significance level from studies with heterogeneous designs.

Assessment Instrument:

The assessment is “assessment 2” contained in the jupyter notebook at https://raw.githubusercontent.com/kshedden/workshops/refs/heads/main/dair3/meta/exercises/exercises%20without%20solutions.ipynb

6.3 Accounting for non-independence and network effects

Students will explore how non-dependent research data skews meta-analyses, pinpoint common sources of dependency (e.g., shared authors or methods), and implement multilevel regression techniques to correct for these biases during synthesis.

Learning Objectives:

The students will understand how non-independence of research results impacts research synthesis, and will be able to identify possible sources of non-independence.
They will be able to employ multilevel regression to account for non-independence when conducting a meta-analysis.

Assessment Instrument:

The assessment is “assessment 3” contained in the juyter notebook at https://raw.githubusercontent.com/kshedden/workshops/refs/heads/main/dair3/meta/exercises/exercises%20without%20solutions.ipynb

Unit 7: Transformer-based AI in Biomedical Research (3 hours)

7.1 Theoretical Foundations – The basics of LLMs

This lesson introduces the theoretical foundations of transformer models, compares them with simple Artificial Neural Networks (ANNs), and demonstrates how transformers evolve into Large Language Models (LLMs). Participants will also learn to choose appropriate tools for specific tasks.

Learning Objectives:

Participants will gain a broad understanding of the theoretical aspects of transformer models.
They will be able to select a tool adequate for a task.
This session will present examples about simple Artificial Neural Networks (ANNs) and simple transformer models.
How transformers become Large Language Models (LLMs).

Assessment Instrument:

In your own words, explain: What are word embeddings? Comment on the difference the number of parameters in a LLM makes. What LLM would you choose for the development of an analytical pipeline among several options including Llama 7B, Llama 30B, GPT 4o, Claude 3.5, and Foundational models. Refer to the Reference Guide in the lesson’s PDF section. Estimated time: 15 min.

7.2 Transformers in Biomedical Research – Consensus in LLMs

Explore how consensus emerges in large language models (LLMs) by working hands-on with multiple systems like OpenAI’s GPT and Anthropic’s Claude. This course emphasizes practical skills in using application programming interfaces (APIs) to interact with these models, focusing on programmatic integration rather than web-based tools. Participants will learn to compare model behaviors, understand API features, and build applications that leverage LLM consensus for real-world tasks.

Learning Objectives:

Participants will gain understanding on how consensus works in LLMs using multiple systems, e.g. GPT, Anthropic’s Claude, etc. However, the emphasis will be on application programming interfaces as opposed to web-based apps.

Assessment Instrument:

Pre-workshop tasks included setting up the environment for local access to LLMs via APIs. Execute a consensus framework either manually between browsers or programmatically with the source code provided. Ask multiple LLMs about routes of analysis for the data of the JHS. Refer to step-by-step tutorial in the lesson’s PDF section. Estimated time: 30 min.

7.3 Data Management for Transformers – LLMs in Biomedical Research

Participants will learn to build a consensus analysis pipeline using LLMs by following a step-by-step template, which they can adapt and expand to complete their technical assignment.

Learning Objectives:

Participants will create a consensus analysis pipeline using LLMs to complete a technical task. A step-by-step template is provided. Participants will be able to expand this template to complete their assignment.

Assessment Instrument:

Create an LLM consensus analysis pipeline to:

Generate source code based on data files, for data ingestion, sanitization, quantitative analysis, and presentation of results using ABBA in Form 1 from the JSH, or
Prepare a proposal for the JHS using guidelines and a research idea via LLMs; participants will not write text directly.

The outcome is either a technical report that is produced as the result of executing the analysis pipeline, or the draft of a proposal for JHS. Refer to step-by-step tutorial in the lesson’s PDF section. Estimated time: 1 hour.