Citation Network Academic Review Workflow for LangGraph
Academic literature reviews require comprehensive discovery, quality filtering, thematic organization, and structured synthesis. This pattern implements a five-phase LangGraph pipeline that generates PhD-equivalent reviews (10,000 to 25,000 words, 50 to 300+ sources) through citation network expansion, two-stage relevance filtering, and dual-strategy clustering.
The Problem
Traditional literature review approaches face several challenges:
- Incomplete Discovery: Keyword search misses foundational works that aren’t indexed with current terminology
- Quality Noise: Raw search results include marginally relevant papers
- Organizational Difficulty: Large corpora need thematic structure for coherent synthesis
- Scalability: Review depth should match available time and budget
The Solution
A five-phase pipeline that addresses each challenge:
Quality Presets for Scalable Depth
Quality presets control workflow parameters, making it easy to trade thoroughness for speed:
QUALITY_PRESETS = {
"quick": {
"max_stages": 2,
"max_papers": 50,
"target_words": 8000,
"saturation": 0.15,
},
"standard": {
"max_stages": 3,
"max_papers": 100,
"target_words": 12000,
"saturation": 0.12,
},
"comprehensive": {
"max_stages": 4,
"max_papers": 200,
"target_words": 17500,
"saturation": 0.10,
},
"high_quality": {
"max_stages": 5,
"max_papers": 300,
"target_words": 25000,
"saturation": 0.10,
},
}
DEFAULT_SETTINGS = {
"use_batch_api": True,
"recency_years": 3,
"recency_quota": 0.25,
}Phase 1: Discovery
LLM-generated OpenAlex queries combined with forward/backward citation seeding find both recent publications and foundational works:
async def discovery_phase_node(state: AcademicLitReviewState) -> dict[str, Any]:
"""Discover seed papers through multiple strategies."""
# 1. Keyword search via OpenAlex
# 2. Forward/backward citation seeding from highly cited papers
# 3. Expert author identification
return {
"keyword_papers": keyword_dois,
"citation_papers": citation_dois,
"expert_papers": expert_dois,
"paper_corpus": initial_corpus,
"current_phase": "diffusion",
}Phase 2: Diffusion with Two-Stage Filtering
Recursive citation expansion discovers related papers. Two-stage filtering reduces LLM scoring costs:
async def enrich_and_score_candidates(
candidates: list[dict],
citation_graph: CitationGraph,
corpus_dois: set[str],
topic: str,
research_questions: list[str],
) -> tuple[list[str], list[tuple[str, float]], list[str]]:
"""
Stage 1: Enrich with co-citation counts (cheap, fast)
Stage 2: LLM scoring with co-citation context (expensive, targeted)
Returns (relevant, fallback, rejected) where:
- relevant: score >= 0.6
- fallback: score 0.5-0.6 (substitutes when acquisition fails)
- rejected: score < 0.5
"""
# Stage 1: Co-citation enrichment
enriched = []
for candidate in candidates:
doi = candidate.get("doi")
if doi:
cocitations = citation_graph.get_corpus_overlap(doi, corpus_dois)
enriched.append({**candidate, "corpus_cocitations": cocitations})
# Stage 2: LLM scoring partitions papers by relevance threshold
relevant, fallback, rejected = await batch_score_relevance(
papers=enriched,
topic=topic,
threshold=0.6,
fallback_threshold=0.5,
)
return (
[p["doi"] for p in relevant],
[(p["doi"], p["relevance_score"]) for p in fallback],
[p["doi"] for p in rejected],
)The co-citation pre-filter reduces LLM calls by approximately 50 percent. The fallback queue (papers scoring 0.5 to 0.6) provides substitutes when paper acquisition fails, maintaining corpus size.
Citation Graph with NetworkX
A simplified graph leverages NetworkX’s built-in capabilities:
class CitationGraph:
"""Manages citation relationships using NetworkX."""
def __init__(self) -> None:
self._graph: nx.DiGraph = nx.DiGraph()
def add_paper(self, doi: str, **metadata: Any) -> None:
"""Add or update a paper node with metadata."""
self._graph.add_node(doi, **metadata)
def add_citation(self, citing_doi: str, cited_doi: str) -> bool:
"""Add citation edge. Returns False if nodes missing or edge exists."""
if citing_doi not in self._graph or cited_doi not in self._graph:
return False
if self._graph.has_edge(citing_doi, cited_doi):
return False
self._graph.add_edge(citing_doi, cited_doi)
return True
def get_corpus_overlap(self, doi: str, corpus_dois: set[str]) -> int:
"""Count papers in corpus that cite or are cited by this paper."""
if doi not in self._graph:
return 0
neighbors = (
set(self._graph.successors(doi)) |
set(self._graph.predecessors(doi))
)
return len(neighbors & corpus_dois)Phase 4: Dual-Strategy Clustering
BERTopic provides statistical clustering; an LLM provides semantic clustering. Claude Opus synthesizes both:
async def synthesize_clusters(
bertopic_clusters: list[dict] | None,
llm_schema: dict | None,
paper_count: int,
) -> list[ThematicCluster]:
"""
Synthesize final clusters from BERTopic and LLM approaches.
Priority:
1. Both available + BERTopic quality good: Opus synthesizes
2. BERTopic failed/poor: Use LLM schema directly
3. LLM failed: Use BERTopic directly
"""
bertopic_good = _evaluate_bertopic_quality(bertopic_clusters, paper_count)
if not bertopic_clusters or not bertopic_good:
return _clusters_from_llm(llm_schema) if llm_schema else []
if not llm_schema:
return _clusters_from_bertopic(bertopic_clusters)
# Both available - Opus synthesizes
return await _opus_synthesize(bertopic_clusters, llm_schema)
def _evaluate_bertopic_quality(clusters: list[dict] | None, paper_count: int) -> bool:
"""BERTopic struggles with small corpora (<30 papers)."""
if not clusters or paper_count < 30:
return False
non_outlier = [c for c in clusters if c.get("cluster_id", -1) >= 0]
return len(non_outlier) >= 2LangGraph State with Reducers
State fields that receive parallel writes use annotated reducers:
from operator import add
def merge_dicts(existing: dict | None, new: dict | None) -> dict:
"""Reducer for dict fields - handles None gracefully."""
result = dict(existing) if existing else {}
if new:
result.update(new)
return result
class AcademicLitReviewState(TypedDict):
# Parallel aggregation via `add` reducer
keyword_papers: Annotated[list[str], add]
citation_papers: Annotated[list[str], add]
errors: Annotated[list[dict], add]
# Dict merging via custom reducer
paper_corpus: Annotated[dict[str, PaperMetadata], merge_dicts]
section_drafts: Annotated[dict[str, str], merge_dicts]Graph Construction
def create_academic_lit_review_graph() -> CompiledStateGraph:
"""Create the five-phase literature review workflow."""
builder: StateGraph = StateGraph(AcademicLitReviewState)
builder.add_node("discovery", discovery_phase_node)
builder.add_node("diffusion", diffusion_phase_node)
builder.add_node("processing", processing_phase_node)
builder.add_node("clustering", clustering_phase_node)
builder.add_node("synthesis", synthesis_phase_node)
builder.add_edge(START, "discovery")
builder.add_edge("discovery", "diffusion")
builder.add_edge("diffusion", "processing")
builder.add_edge("processing", "clustering")
builder.add_edge("clustering", "synthesis")
builder.add_edge("synthesis", END)
return builder.compile()Workflow Structure
flowchart TD Start([START]) P1["Phase 1: DISCOVERY<br/>LLM-generated OpenAlex queries<br/>Forward/backward citation seeding<br/>Expert author identification"] P2["Phase 2: DIFFUSION<br/>Recursive citation expansion (2-5 stages)<br/>Two-stage filtering: co-citation to LLM scoring<br/>Saturation detection"] P3["Phase 3: PROCESSING<br/>Full-text acquisition<br/>Document processing (Marker PDF extraction)<br/>PaperSummary extraction"] P4["Phase 4: CLUSTERING<br/>BERTopic statistical clustering<br/>LLM semantic clustering<br/>Opus synthesis (merges both approaches)"] P5["Phase 5: SYNTHESIS<br/>Parallel thematic section drafting<br/>Intro/methodology/discussion/conclusions<br/>PRISMA documentation generation"] End([END]) Start --> P1 --> P2 --> P3 --> P4 --> P5 --> End
Trade-offs
Benefits:
- Comprehensive discovery via citation network expansion
- Quality-scalable through presets (50 to 300 papers)
- Two-stage filtering reduces LLM costs by approximately 50 percent
- Dual clustering produces robust thematic organization
- PRISMA documentation satisfies academic publication requirements
Costs:
- Processing time: two to four hours for high-quality reviews
- API costs: full-text processing plus clustering plus synthesis
- BERTopic sensitivity: small corpora produce poor statistical clusters
When to Use This Pattern
Good fit:
- Systematic literature reviews for academic or professional research
- Citation network analysis adds value (papers reference each other)
- Output quality must scale with time/budget
- Thematic clustering needed for large paper corpora
Poor fit:
- Simple keyword search suffices
- Corpus is already curated
- Output is informal summaries
- Time constraints prohibit multi-hour processing