GraphRAG Example Datasets¶
This directory contains example datasets carefully crafted to demonstrate GraphRAG’s capabilities for knowledge graph construction and querying.
Programming Languages Dataset (programming_languages.txt
)¶
A concise dataset that showcases GraphRAG’s ability to:
Entity Extraction
People (Guido van Rossum, James Gosling)
Organizations (CWI, Sun Microsystems, Google)
Programming Languages (Python, Java, ABC, Go)
Technologies (REPL, var keyword)
Relationship Types
Creation relationships (“created by”)
Influence relationships (“influenced by”)
Employment relationships (“worked at”)
Usage relationships (“used by”)
Feature relationships (“borrowed ideas from”)
Temporal Information
Creation dates (1991, 1995, 2009)
Sequential influences (ABC → Python → Java)
Complex Reasoning Capabilities
Bidirectional influences (Java ↔ Python)
Multi-hop relationships (ABC → Python → Java’s features)
Organizational relationships (Google’s use of multiple languages)
Example Queries¶
This dataset is ideal for testing queries like:
“What are the main themes of the programming dataset?”
“What’s the relationship between Google and these programming languages?”
“How did early teaching languages influence modern programming languages?”
“Trace the evolution of programming language features through these languages.”
“What role did corporate entities play in language development?”
Community Detection¶
The dataset is structured to form natural communities around:
Language Development (Python, ABC, Guido)
Corporate Influence (Google, Java, Go)
Language Features (OOP, REPL, var keyword)
This makes it perfect for testing GraphRAG’s community detection and summarization capabilities.
Why Traditional RAG Falls Short¶
For the example queries above, traditional RAG approaches would struggle in several ways:
Multi-hop Relationships
Traditional RAG: Can only find direct relationships within single documents
Example: For “How did ABC influence Java’s features?”, traditional RAG might miss the connection because it can’t trace ABC → Python → Java
GraphRAG: Can traverse multiple relationship hops to uncover indirect influences
Community Analysis
Traditional RAG: Limited to keyword matching and proximity-based relationships
Example: “What programming language communities formed around Google?” requires understanding organizational and temporal relationships
GraphRAG: Can detect and analyze communities through relationship patterns and clustering
Bidirectional Relationships
Traditional RAG: Typically treats relationships as unidirectional text mentions
Example: Understanding how Java and Python mutually influenced each other requires tracking bidirectional relationships
GraphRAG: Explicitly models bidirectional relationships and their evolution over time
Complex Entity Relationships
Traditional RAG: Struggles to maintain consistency across multiple entity mentions
Example: “Trace the evolution of REPL features” requires understanding how the feature moved across languages
GraphRAG: Maintains consistent entity relationships across the entire knowledge graph
Temporal Evolution
Traditional RAG: Limited ability to track changes over time
Example: Understanding how language features evolved requires tracking temporal relationships
GraphRAG: Can model and query temporal relationships between entities
These limitations make traditional RAG less effective for complex queries that require understanding relationships, community structures, and temporal evolution. GraphRAG’s knowledge graph approach provides a more complete and accurate representation of these complex relationships.