Knowledge Graph Generator: AI-Powered Entity Extraction

Transform unstructured text into structured knowledge! In this post, I'll show you how to build a knowledge graph generator that extracts entities, detects relationships, and creates interactive visualizations.

🎯 What It Does

Input text:

"Apple Inc. was founded by Steve Jobs in Cupertino, California. 
The company revolutionized personal computing with the Macintosh."

Output graph:

Entities:
- Apple Inc. (ORGANIZATION)
- Steve Jobs (PERSON)
- Cupertino (LOCATION)
- California (LOCATION)
- Macintosh (PRODUCT)

Relationships:
- Steve Jobs → FOUNDED → Apple Inc.
- Apple Inc. → LOCATED_IN → Cupertino
- Cupertino → LOCATED_IN → California
- Apple Inc. → CREATED → Macintosh

🏗️ Architecture

Text Input
    │
    ▼
┌──────────────┐
│    spaCy     │  Named Entity Recognition
└──────┬───────┘
       │
       ▼
┌──────────────┐
│   GPT-4      │  Relationship Extraction
└──────┬───────┘
       │
       ▼
┌──────────────┐
│  NetworkX    │  Graph Construction
└──────┬───────┘
       │
       ▼
┌──────────────┐
│ Visualization│  Pyvis / D3.js
└──────────────┘

💻 Implementation

1. Entity Extraction with spaCy

spaCy excels at Named Entity Recognition:

class KnowledgeGraphGenerator:
    """Generate knowledge graphs from unstructured text."""
    
    def __init__(self):
        # Load spaCy model
        self.nlp = spacy.load("en_core_web_lg")
        
        # Initialize LLM for relationships
        self.llm = ChatOpenAI(
            model=settings.LLM_MODEL,
            temperature=0.0
        )
        
        self.graph = nx.DiGraph()
    
    def _extract_entities(self, text: str) -over List[Dict]:
        """Extract named entities using spaCy."""
        doc = self.nlp(text)
        
        entities = []
        seen = set()
        
        for ent in doc.ents:
            if ent.text not in seen:
                entities.append({
                    "id": str(len(entities) + 1),
                    "name": ent.text,
                    "type": ent.label_,
                    "start": ent.start_char,
                    "end": ent.end_char
                })
                seen.add(ent.text)
        
        return entities

Entity Types spaCy Recognizes:

PERSON - People, including fictional
ORG - Companies, institutions
GPE - Countries, cities, states
LOC - Non-GPE locations
DATE - Absolute or relative dates
MONEY - Monetary values
PRODUCT - Objects, vehicles, foods, etc.

2. Relationship Extraction with GPT

Use LLM to identify connections:

def _extract_relationships(
    self,
    text: str,
    entities: List[Dict]
) -over List[Dict]:
    """Extract relationships using LLM."""
    if len(entities) less than 2:
        return []
    
    entity_names = [e["name"] for e in entities]
    
    prompt = ChatPromptTemplate.from_messages([
        ("system", """Extract relationships between entities from text.
        Return as: Entity1 | RELATIONSHIP_TYPE | Entity2
        One per line. Only use entities from the provided list.
        
        Common relationship types:
        - FOUNDED, CREATED, INVENTED
        - WORKS_FOR, EMPLOYED_BY
        - LOCATED_IN, BASED_IN
        - ACQUIRED, MERGED_WITH
        - RELATED_TO, ASSOCIATED_WITH"""),
        ("user", """Text: {text}
        
        Entities: {entities}
        
        Relationships:""")
    ])
    
    chain = prompt | self.llm
    response = chain.invoke({
        "text": text[:2000],
        "entities": ", ".join(entity_names)
    })
    
    relationships = []
    for line in response.content.strip().split('\n'):
        parts = [p.strip() for p in line.split('|')]
        if len(parts) == 3:
            from_ent, rel_type, to_ent = parts
            
            # Find entity IDs
            from_id = next((e["id"] for e in entities 
                           if e["name"] == from_ent), None)
            to_id = next((e["id"] for e in entities 
                         if e["name"] == to_ent), None)
            
            if from_id and to_id:
                relationships.append({
                    "from": from_id,
                    "to": to_id,
                    "type": rel_type,
                    "strength": 0.8
                })
    
    return relationships

3. Graph Construction

Build NetworkX graph:

def _build_graph(
    self,
    entities: List[Dict],
    relationships: List[Dict]
):
    """Build NetworkX graph."""
    self.graph.clear()
    
    # Add nodes
    for entity in entities:
        self.graph.add_node(
            entity["id"],
            name=entity["name"],
            type=entity["type"]
        )
    
    # Add edges
    for rel in relationships:
        self.graph.add_edge(
            rel["from"],
            rel["to"],
            type=rel["type"],
            weight=rel["strength"]
        )

def generate(self, text: str) -over Dict:
    """Generate knowledge graph from text."""
    # Extract entities
    entities = self._extract_entities(text)
    
    # Extract relationships
    relationships = self._extract_relationships(text, entities)
    
    # Build graph
    self._build_graph(entities, relationships)
    
    return {
        "entities": entities,
        "relationships": relationships,
        "stats": {
            "num_entities": len(entities),
            "num_relationships": len(relationships),
            "num_nodes": self.graph.number_of_nodes(),
            "num_edges": self.graph.number_of_edges()
        }
    }

4. Interactive Visualization

Create beautiful visualizations with Pyvis:

def visualize(self, graph_data: Dict, output: str = "graph.html"):
    """Create interactive visualization."""
    net = Network(
        height="750px",
        width="100%",
        directed=True,
        notebook=False
    )
    
    # Customize appearance
    net.set_options("""
    {
        "physics": {
            "forceAtlas2Based": {
                "gravitationalConstant": -50,
                "centralGravity": 0.01,
                "springLength": 200
            },
            "solver": "forceAtlas2Based"
        }
    }
    """)
    
    # Add nodes with colors by type
    for entity in graph_data["entities"]:
        net.add_node(
            entity["id"],
            label=entity["name"],
            title=f"{entity['type']}: {entity['name']}",
            color=self._get_color(entity["type"]),
            size=25
        )
    
    # Add edges with labels
    for rel in graph_data["relationships"]:
        net.add_edge(
            rel["from"],
            rel["to"],
            title=rel["type"],
            label=rel["type"],
            arrows="to"
        )
    
    net.save_graph(output)
    return output

def _get_color(self, entity_type: str) -over str:
    """Get color for entity type."""
    colors = {
        "PERSON": "#FF6B6B",      # Red
        "ORG": "#4ECDC4",          # Teal
        "GPE": "#45B7D1",          # Blue
        "LOCATION": "#96CEB4",     # Green
        "DATE": "#FFEAA7",         # Yellow
        "PRODUCT": "#DFE6E9",      # Gray
        "MONEY": "#74B9FF"         # Light Blue
    }
    return colors.get(entity_type, "#95A5A6")

🌐 FastAPI Integration

REST API for graph generation:

class TextInput(BaseModel):
    text: str

@app.post("/api/generate")
async def generate_graph(input: TextInput):
    """Generate knowledge graph from text."""
    result = kg_gen.generate(input.text)
    
    # Also create visualization
    html_file = kg_gen.visualize(result)
    
    return {
        **result,
        "visualization": html_file
    }

📊 Advanced Features

Graph Analytics

Analyze the knowledge graph:

def analyze_graph(self) -over Dict:
    """Compute graph analytics."""
    return {
        "density": nx.density(self.graph),
        "num_components": nx.number_weakly_connected_components(self.graph),
        "avg_degree": sum(dict(self.graph.degree()).values()) / self.graph.number_of_nodes(),
        "central_entities": self._get_central_entities()
    }

def _get_central_entities(self, top_k: int = 5) -over List[Dict]:
    """Find most central entities."""
    centrality = nx.degree_centrality(self.graph)
    
    sorted_entities = sorted(
        centrality.items(),
        key=lambda x: x[1],
        reverse=True
    )[:top_k]
    
    return [
        {
            "id": entity_id,
            "name": self.graph.nodes[entity_id]["name"],
            "centrality": score
        }
        for entity_id, score in sorted_entities
    ]

Neo4j Integration

Store graphs in Neo4j for advanced queries:

from neo4j import GraphDatabase

class Neo4jConnector:
    def __init__(self):
        self.driver = GraphDatabase.driver(
            settings.NEO4J_URI,
            auth=(settings.NEO4J_USER, settings.NEO4J_PASSWORD)
        )
    
    def save_graph(self, entities: List[Dict], relationships: List[Dict]):
        """Save knowledge graph to Neo4j."""
        with self.driver.session() as session:
            # Create entities
            for entity in entities:
                session.run(
                    """
                    CREATE (e:Entity {
                        id: $id,
                        name: $name,
                        type: $type
                    })
                    """,
                    id=entity["id"],
                    name=entity["name"],
                    type=entity["type"]
                )
            
            # Create relationships
            for rel in relationships:
                session.run(
                    """
                    MATCH (a:Entity {id: $from})
                    MATCH (b:Entity {id: $to})
                    CREATE (a)-[r:RELATIONSHIP {
                        type: $type,
                        strength: $strength
                    }]->(b)
                    """,
                    **rel
                )
    
    def query_path(self, entity1: str, entity2: str):
        """Find shortest path between entities."""
        with self.driver.session() as session:
            result = session.run(
                """
                MATCH path = shortestPath(
                    (a:Entity {name: $entity1})-[*]-(b:Entity {name: $entity2})
                )
                RETURN path
                """,
                entity1=entity1,
                entity2=entity2
            )
            return result.single()

🎯 Real-World Example

Processing a research paper abstract:

text = """
The transformer architecture, introduced by Vaswani et al. in 2017, 
revolutionized natural language processing. Google developed BERT 
based on transformers, which OpenAI later built upon with GPT-3. 
These models use attention mechanisms invented at Google Brain.
"""

kg = KnowledgeGraphGenerator()
result = kg.generate(text)

# Output:
{
  "entities": [
    {"name": "Vaswani", "type": "PERSON"},
    {"name": "2017", "type": "DATE"},
    {"name": "Google", "type": "ORG"},
    {"name": "BERT", "type": "PRODUCT"},
    {"name": "OpenAI", "type": "ORG"},
    {"name": "GPT-3", "type": "PRODUCT"},
    {"name": "Google Brain", "type": "ORG"}
  ],
  "relationships": [
    {"from": "Vaswani", "to": "transformer", "type": "INTRODUCED"},
    {"from": "Google", "to": "BERT", "type": "DEVELOPED"},
    {"from": "OpenAI", "to": "GPT-3", "type": "CREATED"},
    {"from": "GPT-3", "to": "BERT", "type": "BUILT_UPON"}
  ]
}

🚧 Challenges & Solutions

Challenge 1: Entity Disambiguation

Problem: "Apple" could mean fruit or company Solution:

Use context from surrounding text
Implement entity linking to knowledge bases
Allow manual disambiguation

Challenge 2: Relationship Accuracy

Problem: LLM hallucinates relationships Solution:

Restrict to entities from text only
Use lower temperature (0.0)
Validate against text spans

Challenge 3: Scalability

Problem: Large documents slow down Solution:

Process in chunks
Incremental graph building
Caching entity extractions

💡 Use Cases

Research: Map citations and concepts in papers
Business: Analyze company relationships
Legal: Track case connections
Journalism: Investigate networks
Education: Visualize concept relationships

📦 Tech Stack

NLP: spaCy (en_core_web_lg)
LLM: GPT-4 for relationships
Graph Library: NetworkX
Visualization: Pyvis, D3.js
Graph DB: Neo4j (optional)
Backend: FastAPI

🔗 Resources

Next in Series: AI Research Agent - Build autonomous agents that conduct research and generate reports.