Setup and Load Data into Elasticsearch
This script reads a CSV file containing documents, generates embeddings for a specified "contents" field using a Sentence Transformers model, and indexes the documents into an Elasticsearch index.
Features
- .env Configuration: Optionally reads Elasticsearch host, credentials, index name, and CSV path from a
.envfile. - Index Management: Can optionally create a new index using a default mapping file if
CREATE_INDEXis set toTrue. IfCREATE_INDEXisFalse, the script verifies that the index exists. - CSV Ingestion: Reads documents from a CSV file and verifies the existence of a
contentscolumn. If the column is not found, the script exits. - Embeddings Generation: Uses a
SentenceTransformermodel (paraphrase-MiniLM-L6-v2) to generate 384-dimensional embeddings for each document’s contents.
Requirements
- Python 3.8+
requestspython-dotenvsentence-transformers- A running instance of Elasticsearch (e.g.,
Elasticsearch 7.x+orElasticsearch 8.x+), accessible at the specifiedELASTIC_HOST.
Setup
- Install Dependencies
Install Python dependencies using: ```bash pip install -r requirements.txt
Running the script
```bash python set_up_elasticsearch.py