Dataprep Microservice¶
The Dataprep Microservice aims to preprocess the data from various sources (either structured or unstructured data) to text data, and convert the text data to embedding vectors then store them in the database.
Table of contents¶
Install Requirements¶
apt-get update
apt-get install libreoffice
Summarizing Image Data with LVM¶
Occasionally unstructured data will contain image data, to convert the image data to the text data, LVM (Large Vision Model) can be used to summarize the image. To leverage LVM, please refer to this readme to start the LVM microservice first and then set the below environment variable, before starting any dataprep microservice.
export SUMMARIZE_IMAGE_VIA_LVM=1
Dataprep Microservice on Various Databases¶
Dataprep microservice are supported on various databases, as shown in the table below, for details, please refer to the respective readme listed below.
Databases |
Readme |
---|---|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
Running in the air gapped environment¶
The following steps are common for running the dataprep microservice in an air gapped environment (a.k.a. environment with no internet access), for all DB backends.
Download the following models, e.g.
huggingface-cli download --cache-dir <model data directory> <model>
microsoft/table-transformer-structure-recognition
timm/resnet18.a1_in1k
unstructuredio/yolo_x_layout
launch the
dataprep
microservice with the following settings:
mount the
model data directory
as the/data
directory within thedataprep
containerset environment variable
HF_HUB_OFFLINE
to 1 when launching thedataprep
microservice
e.g. docker run -d -v <model data directory>:/data -e HF_HUB_OFFLINE=1 ... ...