Citation Network Academic Review Workflow for LangGraph

Academic literature reviews require comprehensive discovery, quality filtering, thematic organization, and structured synthesis. This pattern implements a five-phase LangGraph pipeline that generates PhD-equivalent reviews (10,000 to 25,000 words, 50 to 300+ sources) through citation network expansion, two-stage relevance filtering, and dual-strategy clustering.

The Problem

Traditional literature review approaches face several challenges:

  1. Incomplete Discovery: Keyword search misses foundational works that aren’t indexed with current terminology
  2. Quality Noise: Raw search results include marginally relevant papers
  3. Organizational Difficulty: Large corpora need thematic structure for coherent synthesis
  4. Scalability: Review depth should match available time and budget

The Solution

A five-phase pipeline that addresses each challenge:

Quality Presets for Scalable Depth

Quality presets control workflow parameters, making it easy to trade thoroughness for speed:

QUALITY_PRESETS = {
    "quick": {
        "max_stages": 2,
        "max_papers": 50,
        "target_words": 8000,
        "saturation": 0.15,
    },
    "standard": {
        "max_stages": 3,
        "max_papers": 100,
        "target_words": 12000,
        "saturation": 0.12,
    },
    "comprehensive": {
        "max_stages": 4,
        "max_papers": 200,
        "target_words": 17500,
        "saturation": 0.10,
    },
    "high_quality": {
        "max_stages": 5,
        "max_papers": 300,
        "target_words": 25000,
        "saturation": 0.10,
    },
}
 
DEFAULT_SETTINGS = {
    "use_batch_api": True,
    "recency_years": 3,
    "recency_quota": 0.25,
}

Phase 1: Discovery

LLM-generated OpenAlex queries combined with forward/backward citation seeding find both recent publications and foundational works:

async def discovery_phase_node(state: AcademicLitReviewState) -> dict[str, Any]:
    """Discover seed papers through multiple strategies."""
    # 1. Keyword search via OpenAlex
    # 2. Forward/backward citation seeding from highly cited papers
    # 3. Expert author identification
    return {
        "keyword_papers": keyword_dois,
        "citation_papers": citation_dois,
        "expert_papers": expert_dois,
        "paper_corpus": initial_corpus,
        "current_phase": "diffusion",
    }

Phase 2: Diffusion with Two-Stage Filtering

Recursive citation expansion discovers related papers. Two-stage filtering reduces LLM scoring costs:

async def enrich_and_score_candidates(
    candidates: list[dict],
    citation_graph: CitationGraph,
    corpus_dois: set[str],
    topic: str,
    research_questions: list[str],
) -> tuple[list[str], list[tuple[str, float]], list[str]]:
    """
    Stage 1: Enrich with co-citation counts (cheap, fast)
    Stage 2: LLM scoring with co-citation context (expensive, targeted)
 
    Returns (relevant, fallback, rejected) where:
    - relevant: score >= 0.6
    - fallback: score 0.5-0.6 (substitutes when acquisition fails)
    - rejected: score < 0.5
    """
    # Stage 1: Co-citation enrichment
    enriched = []
    for candidate in candidates:
        doi = candidate.get("doi")
        if doi:
            cocitations = citation_graph.get_corpus_overlap(doi, corpus_dois)
            enriched.append({**candidate, "corpus_cocitations": cocitations})
 
    # Stage 2: LLM scoring partitions papers by relevance threshold
    relevant, fallback, rejected = await batch_score_relevance(
        papers=enriched,
        topic=topic,
        threshold=0.6,
        fallback_threshold=0.5,
    )
 
    return (
        [p["doi"] for p in relevant],
        [(p["doi"], p["relevance_score"]) for p in fallback],
        [p["doi"] for p in rejected],
    )

The co-citation pre-filter reduces LLM calls by approximately 50 percent. The fallback queue (papers scoring 0.5 to 0.6) provides substitutes when paper acquisition fails, maintaining corpus size.

Citation Graph with NetworkX

A simplified graph leverages NetworkX’s built-in capabilities:

class CitationGraph:
    """Manages citation relationships using NetworkX."""
 
    def __init__(self) -> None:
        self._graph: nx.DiGraph = nx.DiGraph()
 
    def add_paper(self, doi: str, **metadata: Any) -> None:
        """Add or update a paper node with metadata."""
        self._graph.add_node(doi, **metadata)
 
    def add_citation(self, citing_doi: str, cited_doi: str) -> bool:
        """Add citation edge. Returns False if nodes missing or edge exists."""
        if citing_doi not in self._graph or cited_doi not in self._graph:
            return False
        if self._graph.has_edge(citing_doi, cited_doi):
            return False
        self._graph.add_edge(citing_doi, cited_doi)
        return True
 
    def get_corpus_overlap(self, doi: str, corpus_dois: set[str]) -> int:
        """Count papers in corpus that cite or are cited by this paper."""
        if doi not in self._graph:
            return 0
        neighbors = (
            set(self._graph.successors(doi)) |
            set(self._graph.predecessors(doi))
        )
        return len(neighbors & corpus_dois)

Phase 4: Dual-Strategy Clustering

BERTopic provides statistical clustering; an LLM provides semantic clustering. Claude Opus synthesizes both:

async def synthesize_clusters(
    bertopic_clusters: list[dict] | None,
    llm_schema: dict | None,
    paper_count: int,
) -> list[ThematicCluster]:
    """
    Synthesize final clusters from BERTopic and LLM approaches.
 
    Priority:
    1. Both available + BERTopic quality good: Opus synthesizes
    2. BERTopic failed/poor: Use LLM schema directly
    3. LLM failed: Use BERTopic directly
    """
    bertopic_good = _evaluate_bertopic_quality(bertopic_clusters, paper_count)
 
    if not bertopic_clusters or not bertopic_good:
        return _clusters_from_llm(llm_schema) if llm_schema else []
 
    if not llm_schema:
        return _clusters_from_bertopic(bertopic_clusters)
 
    # Both available - Opus synthesizes
    return await _opus_synthesize(bertopic_clusters, llm_schema)
 
 
def _evaluate_bertopic_quality(clusters: list[dict] | None, paper_count: int) -> bool:
    """BERTopic struggles with small corpora (<30 papers)."""
    if not clusters or paper_count < 30:
        return False
    non_outlier = [c for c in clusters if c.get("cluster_id", -1) >= 0]
    return len(non_outlier) >= 2

LangGraph State with Reducers

State fields that receive parallel writes use annotated reducers:

from operator import add
 
def merge_dicts(existing: dict | None, new: dict | None) -> dict:
    """Reducer for dict fields - handles None gracefully."""
    result = dict(existing) if existing else {}
    if new:
        result.update(new)
    return result
 
class AcademicLitReviewState(TypedDict):
    # Parallel aggregation via `add` reducer
    keyword_papers: Annotated[list[str], add]
    citation_papers: Annotated[list[str], add]
    errors: Annotated[list[dict], add]
 
    # Dict merging via custom reducer
    paper_corpus: Annotated[dict[str, PaperMetadata], merge_dicts]
    section_drafts: Annotated[dict[str, str], merge_dicts]

Graph Construction

def create_academic_lit_review_graph() -> CompiledStateGraph:
    """Create the five-phase literature review workflow."""
    builder: StateGraph = StateGraph(AcademicLitReviewState)
 
    builder.add_node("discovery", discovery_phase_node)
    builder.add_node("diffusion", diffusion_phase_node)
    builder.add_node("processing", processing_phase_node)
    builder.add_node("clustering", clustering_phase_node)
    builder.add_node("synthesis", synthesis_phase_node)
 
    builder.add_edge(START, "discovery")
    builder.add_edge("discovery", "diffusion")
    builder.add_edge("diffusion", "processing")
    builder.add_edge("processing", "clustering")
    builder.add_edge("clustering", "synthesis")
    builder.add_edge("synthesis", END)
 
    return builder.compile()

Workflow Structure

flowchart TD
  Start([START])
  P1["Phase 1: DISCOVERY<br/>LLM-generated OpenAlex queries<br/>Forward/backward citation seeding<br/>Expert author identification"]
  P2["Phase 2: DIFFUSION<br/>Recursive citation expansion (2-5 stages)<br/>Two-stage filtering: co-citation to LLM scoring<br/>Saturation detection"]
  P3["Phase 3: PROCESSING<br/>Full-text acquisition<br/>Document processing (Marker PDF extraction)<br/>PaperSummary extraction"]
  P4["Phase 4: CLUSTERING<br/>BERTopic statistical clustering<br/>LLM semantic clustering<br/>Opus synthesis (merges both approaches)"]
  P5["Phase 5: SYNTHESIS<br/>Parallel thematic section drafting<br/>Intro/methodology/discussion/conclusions<br/>PRISMA documentation generation"]
  End([END])
  Start --> P1 --> P2 --> P3 --> P4 --> P5 --> End

Trade-offs

Benefits:

  • Comprehensive discovery via citation network expansion
  • Quality-scalable through presets (50 to 300 papers)
  • Two-stage filtering reduces LLM costs by approximately 50 percent
  • Dual clustering produces robust thematic organization
  • PRISMA documentation satisfies academic publication requirements

Costs:

  • Processing time: two to four hours for high-quality reviews
  • API costs: full-text processing plus clustering plus synthesis
  • BERTopic sensitivity: small corpora produce poor statistical clusters

When to Use This Pattern

Good fit:

  • Systematic literature reviews for academic or professional research
  • Citation network analysis adds value (papers reference each other)
  • Output quality must scale with time/budget
  • Thematic clustering needed for large paper corpora

Poor fit:

  • Simple keyword search suffices
  • Corpus is already curated
  • Output is informal summaries
  • Time constraints prohibit multi-hour processing