RELIANCE: Curating and Evaluating Reproductive Health Information on Social Media

Vaibhav Balloli¹, Laura Peyton Ellis², Vishala Mishra³, Alice Chi¹, Alex Peahl¹, Elizabeth Bondi-Kelly¹

¹University of Michigan · ²University of Connecticut School of Medicine · ³Duke University School of Medicine

Abstract

Social media platforms like TikTok have become a key source of health information, including for questions from the peripartum period: before, during, and after pregnancy. Inaccurate health information in this setting can have adverse consequences. As Large Language Model (LLM) providers increasingly integrate LLMs into digital platforms to fact-check content, these systems need to be evaluated on real-world multimodal videos.

We introduce RELIANCE, an expert-annotated dataset of reproductive health information surfaced by TikTok search. The dataset starts from 56 clinician-reviewed natural-language questions about the peripartum period, collects the top six TikTok video results for each question, and asks expert clinicians to identify medically relevant sentences or paragraphs in the transcripts.

RELIANCE contains 409 annotated sentences or paragraphs from 336 videos. Clinicians label each one for inaccuracy and harmfulness, making it possible to separate information that is not supported by scientific evidence or standard clinical practice from medically dangerous information that can cause adverse consequences. We use the same annotations to evaluate whether LLMs can detect inaccurate and harmful information.

Dataset

Pie chart showing RELIANCE query distribution by pregnancy stage. — Each search question was assigned to a stage in the peripartum period so the dataset could be summarized by stage.

Pie chart showing RELIANCE query distribution by reproductive health category. — Clinicians also assigned each search question to a reproductive health category so the dataset could be summarized by query topic.

Correlation heatmap comparing accuracy and harmfulness labels with engagement metrics. — The heatmap compares clinician labels with views, likes, shares, comments, and creator followers; popularity was not a reliable signal of medical accuracy or harm.

What Search Returned

Videos found via TikTok search were largely relevant to the clinician-reviewed questions, and the collected videos averaged roughly 273K views.

What Clinicians Labeled

About 69.9% of annotated sentences or paragraphs were accurate, while 16.2% were harmful. Inaccurate information can be harmless or harmful, so the dataset keeps those labels separate.

Popularity Is Not Reliability

Views, likes, shares, comments, and creator followers do not reliably indicate whether a video contains accurate reproductive health information.

LLM Evaluation

RELIANCE evaluates general-purpose LLMs on two granularities. In claim-level verification, the model checks a specific clinician-highlighted sentence or paragraph from a video. In video-level detection, the model receives the query and complete transcript and must decide whether the video contains inaccurate or harmful reproductive health information.

Gemini-2.5-Flash performs best overall in these evaluations, while most models are better at identifying harmful information than identifying inaccuracies. Performance drops when moving from specific claim verification to whole-video judgment, with a consistent gap of about 15% across tasks.

RELIANCE also tests safeguard models that flag user questions, video transcripts, and LLM responses for violations of their internal guidelines. These systems behave inconsistently on reproductive health content. The omni-moderation model flagged the query "How much kicking should a pregnant woman feel?" as violence, while Llama Guard often flagged responses as specialized advice.

Line charts showing LLM performance on detecting inaccurate reproductive health information. — Model performance on detecting inaccurate information, measured with accuracy, precision, recall, and calibration metrics.

Line charts showing LLM performance on detecting harmful reproductive health information. — Model performance on detecting harmful information; harmfulness is evaluated separately from factual accuracy.

Hamming distance heatmap comparing LLM predictions for the inaccuracy detection task. — Hamming distance shows how often pairs of models disagree on the same inaccuracy labels, which helps identify whether an ensemble would add complementary judgments.

Qwen model scaling panels for accuracy, precision, recall, calibration, and Brier score. — Comparing Qwen models of different sizes shows that larger models in the same family did not consistently improve reproductive health fact-checking performance.

Takeaways

Real-World Data Matters

Medical exam and question-answering benchmarks do not fully capture real-world multimodal data from TikTok videos.

Claim Granularity Matters

Models perform differently when checking one highlighted sentence or paragraph versus judging the full transcript of a video.

Safety Needs Domain Context

Safeguard models flagged different reproductive health queries and contexts under different violation categories, showing a need for consistency in this domain.