Dataprep Microservice with ArangoDB

🚀Start Microservice with Docker

Start ArangoDB Server

To launch ArangoDB locally, first ensure you have docker installed. Then, you can launch the database with the following docker command.

docker run -d -p 8529:8529 -e ARANGO_ROOT_PASSWORD=test arangodb/arangodb:latest

Set Environment Variables

export no_proxy=${your_no_proxy}
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export ARANGO_URL=${your_arango_url} # e.g. http://localhost:8529
export ARANGO_USERNAME=${your_arango_username} # e.g. root
export ARANGO_PASSWORD=${your_arango_password} # e.g test
export ARANGO_DB_NAME=${your_db_name} # e.g _system
export VLLM_ENDPOINT=${your_vllm_endpoint}
export VLLM_MODEL_ID=${your_vllm_model_id}
export VLLM_API_KEY=${your_vllm_api_key}
export TEI_EMBEDDING_ENDPOINT=${your_tei_embedding_endpoint}
export HUGGINGFACEHUB_API_TOKEN=${your_huggingface_api_token}

Build Docker Image

cd ~/GenAIComps/
docker build -t opea/dataprep:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/src/Dockerfile .

Run via CLI

docker run -d --name="dataprep-arango-service" -p 6007:5000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e ARANGODB_URL="http://localhost:8529" -e ... -e DATAPREP_COMPONENT_NAME="OPEA_DATAPREP_ARANGODB" opea/dataprep:latest

Run Docker with Docker Compose

cd ~/GenAIComps/comps/dataprep/deployment/docker_compose/
docker compose up dataprep-arangodb -d

See below for additional environment variables that can be set.

🚀3. Consume Dataprep Service

curl http://${your_ip}:6007/v1/health_check \
  -X GET \
  -H 'Content-Type: application/json'

An ArangoDB Graph is created from the documents provided to the microservice. The microservice will extract entities from the documents and create nodes and relationships in the graph based on the entities extracted. The microservice will also create embeddings for the documents if embedding environment variables are specified.

curl -X POST \
    -H "Content-Type: multipart/form-data" \
    -F "files=@./file1.txt" \
    http://localhost:6007/v1/dataprep/ingest

By default, the microservice will create embeddings for the documents if embedding environment variables are specified.

You can also specify the chunk_size and chunk_overlap with the following parameters:

curl -X POST \
    -H "Content-Type: multipart/form-data" \
    -F "files=@./file1.txt" \
    -F "chunk_size=1500" \
    -F "chunk_overlap=100" \
    http://localhost:6007/v1/dataprep/ingest

We support table extraction from pdf documents. You can specify process_table and table_strategy with the following parameters:

  • table_strategy refers to the strategies to understand tables for table retrieval. As the setting progresses from "fast" to "hq" to "llm", the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is "fast".

  • process_table refers to whether to process tables in the document. The default value is False.

Note: If you specify "table_strategy=llm", you should first start the vLLM Service.

curl -X POST \
    -H "Content-Type: multipart/form-data" \
    -F "files=@./your_file.pdf" \
    -F "process_table=true" \
    -F "table_strategy=hq" \
    http://localhost:6007/v1/dataprep/ingest

Additional options that can be specified from the environment variables are as follows (default values are in the arangodb.py file):

ArangoDB Connection configuration

  • ARANGO_URL: The URL for the ArangoDB service.

  • ARANGO_USERNAME: The username for the ArangoDB service.

  • ARANGO_PASSWORD: The password for the ArangoDB service.

  • ARANGO_DB_NAME: The name of the database to use for the ArangoDB service.

ArangoDB Graph Insertion configuration

  • ARANGO_INSERT_ASYNC: If set to True, the microservice will insert the data into ArangoDB asynchronously. Defaults to False.

  • ARANGO_BATCH_SIZE: The batch size for the microservice to insert the data. Defaults to 500.

  • ARANGO_GRAPH_NAME: The name of the graph to use/create in ArangoDB Defaults to GRAPH.

vLLM Configuration

  • VLLM_API_KEY: The API key for the vLLM service. Defaults to "EMPTY".

  • VLLM_ENDPOINT: The endpoint for the VLLM service. Defaults to http://localhost:80.

  • VLLM_MODEL_ID: The model ID for the VLLM service. Defaults to Intel/neural-chat-7b-v3-3.

  • VLLM_MAX_NEW_TOKENS: The maximum number of new tokens to generate. Defaults to 512.

  • VLLM_TOP_P: If set to < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Defaults to 0.9.

  • VLLM_TEMPERATURE: The temperature for the sampling. Defaults to 0.8.

  • VLLM_TIMEOUT: The timeout for the VLLM service. Defaults to 600.

Text Embeddings Inferencing Configuration

  • TEI_EMBEDDING_ENDPOINT: The endpoint for the TEI service.

  • TEI_EMBED_MODEL: The model to use for the TEI service. Defaults to BAAI/bge-base-en-v1.5.

  • HUGGINGFACEHUB_API_TOKEN: The API token for the Hugging Face Hub.

  • EMBED_CHUNKS: If set to True, the microservice will embed the chunks. Defaults to True.

  • EMBED_NODES: If set to True, the microservice will embed the nodes extracted from the source documents. Defaults to True.

  • EMBED_EDGES: If set to True, the microservice will embed the edges extracted from the source documents. Defaults to True.

OpenAI Configuration: Note: This configuration can replace the VLLM and TEI services for text generation and embeddings.

  • OPENAI_API_KEY: The API key for the OpenAI service. If not set, the microservice will not use the OpenAI service.

  • OPENAI_CHAT_MODEL: The chat model to use for the OpenAI service. Defaults to gpt-4o.

  • OPENAI_CHAT_TEMPERATURE: The temperature for the OpenAI service. Defaults to 0.

  • OPENAI_EMBED_MODEL: The embedding model to use for the OpenAI service. Defaults to text-embedding-3-small.

  • OPENAI_EMBED_DIMENSION: The embedding dimension for the OpenAI service. Defaults to 768.

  • OPENAI_CHAT_ENABLED: If set to True, the microservice will use the OpenAI service for text generation, as long as OPENAI_API_KEY is also set. Defaults to True.

  • OPENAI_EMBED_ENABLED: If set to True, the microservice will use the OpenAI service for text embeddings, as long as OPENAI_API_KEY is also set. Defaults to True.`

LangChain LLMGraphTransformer Configuration:

  • SYSTEM_PROMPT_PATH: The path to the system prompt text file. This can be used to specify the specific system prompt for the entity extraction and graph generation steps.

  • ALLOWED_NODE_TYPES: Specifies which node types are allowed in the graph. Defaults to an empty list, allowing all node types.

  • ALLOWED_EDGE_TYPES: Specifies which edge types are allowed in the graph. Defaults to an empty list, allowing all edge types.

  • NODE_PROPERTIES: If True, the LLM can extract any node properties from text. Alternatively, a list of valid properties can be provided for the LLM to extract, restricting extraction to those specified. Defaults to ["description"].

  • EDGE_PROPERTIES: If True, the LLM can extract any edge properties from text. Alternatively, a list of valid properties can be provided for the LLM to extract, restricting extraction to those specified. Defaults to ["description"].

  • TEXT_CAPITALIZATION_STRATEGY: The capitalization strategy applied on the node and edge text. Can be “lower”, “upper”, or “none”. Defaults to “none”. Useful as a basic Entity Resolution technique to avoid duplicates based on capitalization.

  • INCLUDE_CHUNKS: If set to True, the microservice will include the chunks of text from the source documents in the graph. Defaults to True. If False, only the entities and relationships will be included in the graph.

Some of these parameters are also available via parameters in the API call. If set, these will override the equivalent environment variables:

class DataprepRequest(BaseModel):
    ...


class ArangoDBDataprepRequest(DataprepRequest):
    def __init__(
        self,
        files: Optional[Union[UploadFile, List[UploadFile]]] = File(None),
        link_list: Optional[str] = Form(None),
        chunk_size: Optional[int] = Form(1500),
        chunk_overlap: Optional[int] = Form(100),
        process_table: Optional[bool] = Form(False),
        table_strategy: Optional[str] = Form("fast"),
        graph_name: Optional[str] = Form(None),
        insert_async: Optional[bool] = Form(None),
        insert_batch_size: Optional[int] = Form(None),
        embed_nodes: Optional[bool] = Form(None),
        embed_edges: Optional[bool] = Form(None),
        embed_chunks: Optional[bool] = Form(None),
        allowed_node_types: Optional[List[str]] = Form(None),
        allowed_edge_types: Optional[List[str]] = Form(None),
        node_properties: Optional[List[str]] = Form(None),
        edge_properties: Optional[List[str]] = Form(None),
        text_capitalization_strategy: Optional[str] = Form(None),
        include_chunks: Optional[bool] = Form(None),
        ...

See the comps/cores/proto/api_protocol.py file for more details on the API request and response models.