Dataprep Microservice with ArangoDB¶
🚀Start Microservice with Docker¶
Start ArangoDB Server¶
To launch ArangoDB locally, first ensure you have docker installed. Then, you can launch the database with the following docker command.
docker run -d -p 8529:8529 -e ARANGO_ROOT_PASSWORD=test arangodb/arangodb:latest
Set Environment Variables¶
export no_proxy=${your_no_proxy}
export http_proxy=${your_http_proxy}
export https_proxy=${your_http_proxy}
export ARANGO_URL=${your_arango_url} # e.g. http://localhost:8529
export ARANGO_USERNAME=${your_arango_username} # e.g. root
export ARANGO_PASSWORD=${your_arango_password} # e.g test
export ARANGO_DB_NAME=${your_db_name} # e.g _system
export VLLM_ENDPOINT=${your_vllm_endpoint}
export VLLM_MODEL_ID=${your_vllm_model_id}
export VLLM_API_KEY=${your_vllm_api_key}
export TEI_EMBEDDING_ENDPOINT=${your_tei_embedding_endpoint}
export HUGGINGFACEHUB_API_TOKEN=${your_huggingface_api_token}
Build Docker Image¶
cd ~/GenAIComps/
docker build -t opea/dataprep:latest --build-arg https_proxy=$https_proxy --build-arg http_proxy=$http_proxy -f comps/dataprep/src/Dockerfile .
Run via CLI¶
docker run -d --name="dataprep-arango-service" -p 6007:5000 --ipc=host -e http_proxy=$http_proxy -e https_proxy=$https_proxy -e ARANGODB_URL="http://localhost:8529" -e ... -e DATAPREP_COMPONENT_NAME="OPEA_DATAPREP_ARANGODB" opea/dataprep:latest
Run Docker with Docker Compose¶
cd ~/GenAIComps/comps/dataprep/deployment/docker_compose/
docker compose up dataprep-arangodb -d
See below for additional environment variables that can be set.
🚀3. Consume Dataprep Service¶
curl http://${your_ip}:6007/v1/health_check \
-X GET \
-H 'Content-Type: application/json'
An ArangoDB Graph is created from the documents provided to the microservice. The microservice will extract entities from the documents and create nodes and relationships in the graph based on the entities extracted. The microservice will also create embeddings for the documents if embedding environment variables are specified.
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
http://localhost:6007/v1/dataprep/ingest
By default, the microservice will create embeddings for the documents if embedding environment variables are specified.
You can also specify the chunk_size
and chunk_overlap
with the following parameters:
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./file1.txt" \
-F "chunk_size=1500" \
-F "chunk_overlap=100" \
http://localhost:6007/v1/dataprep/ingest
We support table extraction from pdf documents. You can specify process_table
and table_strategy
with the following parameters:
table_strategy
refers to the strategies to understand tables for table retrieval. As the setting progresses from"fast"
to"hq"
to"llm"
, the focus shifts towards deeper table understanding at the expense of processing speed. The default strategy is"fast"
.process_table
refers to whether to process tables in the document. The default value isFalse
.
Note: If you specify "table_strategy=llm"
, you should first start the vLLM Service.
curl -X POST \
-H "Content-Type: multipart/form-data" \
-F "files=@./your_file.pdf" \
-F "process_table=true" \
-F "table_strategy=hq" \
http://localhost:6007/v1/dataprep/ingest
Additional options that can be specified from the environment variables are as follows (default values are in the arangodb.py
file):
ArangoDB Connection configuration
ARANGO_URL
: The URL for the ArangoDB service.ARANGO_USERNAME
: The username for the ArangoDB service.ARANGO_PASSWORD
: The password for the ArangoDB service.ARANGO_DB_NAME
: The name of the database to use for the ArangoDB service.
ArangoDB Graph Insertion configuration
ARANGO_INSERT_ASYNC
: If set to True, the microservice will insert the data into ArangoDB asynchronously. Defaults toFalse
.ARANGO_BATCH_SIZE
: The batch size for the microservice to insert the data. Defaults to500
.ARANGO_GRAPH_NAME
: The name of the graph to use/create in ArangoDB Defaults toGRAPH
.
vLLM Configuration
VLLM_API_KEY
: The API key for the vLLM service. Defaults to"EMPTY"
.VLLM_ENDPOINT
: The endpoint for the VLLM service. Defaults tohttp://localhost:80
.VLLM_MODEL_ID
: The model ID for the VLLM service. Defaults toIntel/neural-chat-7b-v3-3
.VLLM_MAX_NEW_TOKENS
: The maximum number of new tokens to generate. Defaults to512
.VLLM_TOP_P
: If set to < 1, only the smallest set of most probable tokens with probabilities that add up to top_p or higher are kept for generation. Defaults to0.9
.VLLM_TEMPERATURE
: The temperature for the sampling. Defaults to0.8
.VLLM_TIMEOUT
: The timeout for the VLLM service. Defaults to600
.
Text Embeddings Inferencing Configuration
TEI_EMBEDDING_ENDPOINT
: The endpoint for the TEI service.TEI_EMBED_MODEL
: The model to use for the TEI service. Defaults toBAAI/bge-base-en-v1.5
.HUGGINGFACEHUB_API_TOKEN
: The API token for the Hugging Face Hub.EMBED_CHUNKS
: If set to True, the microservice will embed the chunks. Defaults toTrue
.EMBED_NODES
: If set to True, the microservice will embed the nodes extracted from the source documents. Defaults toTrue
.EMBED_EDGES
: If set to True, the microservice will embed the edges extracted from the source documents. Defaults toTrue
.
OpenAI Configuration: Note: This configuration can replace the VLLM and TEI services for text generation and embeddings.
OPENAI_API_KEY
: The API key for the OpenAI service. If not set, the microservice will not use the OpenAI service.OPENAI_CHAT_MODEL
: The chat model to use for the OpenAI service. Defaults togpt-4o
.OPENAI_CHAT_TEMPERATURE
: The temperature for the OpenAI service. Defaults to0
.OPENAI_EMBED_MODEL
: The embedding model to use for the OpenAI service. Defaults totext-embedding-3-small
.OPENAI_EMBED_DIMENSION
: The embedding dimension for the OpenAI service. Defaults to768
.OPENAI_CHAT_ENABLED
: If set to True, the microservice will use the OpenAI service for text generation, as long asOPENAI_API_KEY
is also set. Defaults toTrue
.OPENAI_EMBED_ENABLED
: If set to True, the microservice will use the OpenAI service for text embeddings, as long asOPENAI_API_KEY
is also set. Defaults toTrue
.`
LangChain LLMGraphTransformer Configuration:
SYSTEM_PROMPT_PATH
: The path to the system prompt text file. This can be used to specify the specific system prompt for the entity extraction and graph generation steps.ALLOWED_NODE_TYPES
: Specifies which node types are allowed in the graph. Defaults to an empty list, allowing all node types.ALLOWED_EDGE_TYPES
: Specifies which edge types are allowed in the graph. Defaults to an empty list, allowing all edge types.NODE_PROPERTIES
: If True, the LLM can extract any node properties from text. Alternatively, a list of valid properties can be provided for the LLM to extract, restricting extraction to those specified. Defaults to["description"]
.EDGE_PROPERTIES
: If True, the LLM can extract any edge properties from text. Alternatively, a list of valid properties can be provided for the LLM to extract, restricting extraction to those specified. Defaults to["description"]
.TEXT_CAPITALIZATION_STRATEGY
: The capitalization strategy applied on the node and edge text. Can be “lower”, “upper”, or “none”. Defaults to “none”. Useful as a basic Entity Resolution technique to avoid duplicates based on capitalization.INCLUDE_CHUNKS
: If set to True, the microservice will include the chunks of text from the source documents in the graph. Defaults toTrue
. IfFalse
, only the entities and relationships will be included in the graph.
Some of these parameters are also available via parameters in the API call. If set, these will override the equivalent environment variables:
class DataprepRequest(BaseModel):
...
class ArangoDBDataprepRequest(DataprepRequest):
def __init__(
self,
files: Optional[Union[UploadFile, List[UploadFile]]] = File(None),
link_list: Optional[str] = Form(None),
chunk_size: Optional[int] = Form(1500),
chunk_overlap: Optional[int] = Form(100),
process_table: Optional[bool] = Form(False),
table_strategy: Optional[str] = Form("fast"),
graph_name: Optional[str] = Form(None),
insert_async: Optional[bool] = Form(None),
insert_batch_size: Optional[int] = Form(None),
embed_nodes: Optional[bool] = Form(None),
embed_edges: Optional[bool] = Form(None),
embed_chunks: Optional[bool] = Form(None),
allowed_node_types: Optional[List[str]] = Form(None),
allowed_edge_types: Optional[List[str]] = Form(None),
node_properties: Optional[List[str]] = Form(None),
edge_properties: Optional[List[str]] = Form(None),
text_capitalization_strategy: Optional[str] = Form(None),
include_chunks: Optional[bool] = Form(None),
...
See the comps/cores/proto/api_protocol.py
file for more details on the API request and response models.