Skip to main content

Knowledge base from a PDF file

In this section, we will discuss how to create a vector collection snapshot from a PDF file. First, we will parse the unstructured PDF file to a structured markdown file. Then, we will follow the steps from Knowledge base from a markdown file to create embedding for your PDF files.

Tools to convert a PDF file to a markdown file

Tool #1: LlamaParse

LlamaParse is a tool to parse files for optimal RAG. You will need a LlamaCloud key from https://cloud.llamaindex.ai.

First, install the dependencies. we are assuming that you already have Node.JS 20+ installed.

git clone https://github.com/alabulei1/llamaparse-integration.git
cd llamaparse-integration
npm install llamaindex
npm install dotenv

Then, edit the .env file to set up the PDF file path and LlamaCloud Key. In this case, you don't need to care about the LLM-related settings.

After that, run the following command line to parse your pdf into a markdown file.

npx tsx transMd.ts

The output markdown file will be located in this folder named output.md by default. You can change the path in the .env file.

Tool #2: GPTPDF

GPTPDF is an open-source tool using GPT-4o to parse PDF into markdown. You will need an OpenAI key here.

First, install the gptpdf software.

pip install gptpdf

Then, enter the Python environment.

python

Next, use the following command to parse your pdf.

from gptpdf import parse_pdf
api_key = 'Your OpenAI API Key'
content, image_paths = parse_pdf(Your_Pdf_Path, api_key=api_key)
print(content)

The output markdown files called output.md will be located in your root directory.

Create embeddings from the markdown files

Please follow the tutorial Knowledge base from a markdown file to convert your markdown file to a snapshot of embeddings that can be imported into a GaiaNet node.