Propostas inseridas

DEI - FCTUC
Gerado a 2024-07-17 10:19:54 (Europe/Lisbon).
Voltar

Titulo Estágio

Geospatial text-to-SQL dataset to benchmark large language models

Local do Estágio

DEI-FCTUC

Enquadramento

Text-to-SQL parsing, which aims at converting natural language questions into executable SQL queries, has gained increasing attention in recent years. In particular, after the introduction of Large Language Models (LLMs) that have already shown interesting results on this task. This is particularly interesting to allow non-technical people to query relational databases without understanding formal languages. The geospatial field has been less explored in this domain. Therefore, fewer geospatial text-to-SQL datasets are available for training such models.

Objetivo

The main objectives of this dissertation are:

1) Create a comprehensive benchmark dataset for geospatial text-to-SQL parsing.
2) Test the dataset using various LLM models (e.g., GPT-3.5, LLAMA 2, etc.).

Plano de Trabalhos - Semestre 1

- Literature review on the use of LLMs to perform regular text-to-SQL and geospatial text-to-SQL, as well as methods to develop text-to-SQL datasets for benchmarking purposes;
- Define the relevant criteria to develop the geospatial text-to-SQL using the following paper as a basis: https://doi.org/10.3390/ijgi13010026
- Write the intermediate report.

At the end of the first semester, the student should be familiar with text-to-SQL parsing and geospatial datasets and operations. The student should also be familiar with the PostgreSQL/PostGIS DBMS.

Plano de Trabalhos - Semestre 2

- Build the geospatial text-to-SQL dataset
- Evaluate different LLMs with queries of varying levels of complexity using the created dataset
- Write the dissertation and a scientific paper.

Condições

The student will be integrated within the Information Systems group of CISUC. A workplace and the required resources will be provided.

Observações

To conduct this work, geographic data will be extracted from the OpenStreetMap project (i.e., an open and free geospatial database built by volunteers from around the globe). Extracts of the database will be made with different sizes to test queries with varying levels of complexity. If necessary, census data for Portugal will be downloaded from the Pordata platform (https://www.pordata.pt/censos/resultados/emdestaque-portugal-1075).

Orientador

Jacinto Estima
estima@dei.uc.pt 📩