Titulo Estágio
Content Segmentation in Digital Documents
Áreas de especialidade
Sistemas Inteligentes
Local do Estágio
Novabase / Center for Cognitive Computing / Coimbra
Enquadramento
The recently created Center for Cognitive Computing of Novabase, headquartered in Coimbra, is dedicated to applying the latest cognitive technologies, delivering the best solutions for the most challenging problems. With the increasing size of organizations and complexity of their processes, we live in an era where information is the most valuable asset of any business and their biggest challenge at the same time. The value hidden in this ever growing ocean of information can be used to empower people, automate processes and improve experiences. The cognitive technologies are here to uncover that potential and transform businesses once again, to something that we could only imagine until now.
A considerable amount of the information produced every day in today’s organizations is stored, organized, and distributed in the form of digital documents. Being able to automatically process these documents and identify their different parts is a primary step for a set of different processes that aim to extract valuable knowledge from their unstructured nature. But, due to the range of formats used, the particularities of each format, and the diversity of the content produced, identifying the different blocks, such as text, images, tables, etc, in a digital document, such as a PDF, is a challenge by itself.
The task of content segmentation in digital documents makes use of a set of different approaches, from Natural Language Processing to Digital Image Processing and Deep Learning, to identify and characterize the different blocks that are represented in a digital document. This includes, for instance, identifying different types of text blocks (title, abstract, table of contents, paragraphs, etc), image blocks (including image type, captions, overlaid text, etc), and tables (including header, columns, lines, content, etc).
Objetivo
The main objective of this internship is the development of a framework for Content Segmentation in Digital Documents based on state of the art approaches, to support the implementation of solutions that require the analysis of unstructured information in digital format (e.g. automatic question answering over financial documents).
The development of such a framework will require the completion of the following goals:
- Analysis of the state of the art, available technologies and existing competitors.
- Analysis and comparison of possible approaches.
- Implementation of a solution for content segmentation in digital documents, including support for document parsing and content segment extraction, as well as adequate APIs.
- Experimentation and fine-tuning of the implemented solution.
By the end of the internship, the intern should have gained experience in the development of solutions at an enterprise level, including processes and expected deliverables. More specifically, the intern will have acquired relevant knowledge about the design, implementation and experimentation of a framework for Content Segmentation in Digital Documents, including applicable and relevant approaches.
Plano de Trabalhos - Semestre 1
The main objective of this internship is the development of a framework for Content Segmentation in Digital Documents based on state of the art approaches, to support the implementation of solutions that require the analysis of unstructured information in digital format (e.g. automatic question answering over financial documents).
The development of such a framework will require the completion of the following goals:
- Analysis of the state of the art, available technologies and existing competitors.
- Analysis and comparison of possible approaches.
- Implementation of a solution for content segmentation in digital documents, including support for document parsing and content segment extraction, as well as adequate APIs.
- Experimentation and fine-tuning of the implemented solution.
By the end of the internship, the intern should have gained experience in the development of solutions at an enterprise level, including processes and expected deliverables. More specifically, the intern will have acquired relevant knowledge about the design, implementation and experimentation of a framework for Content Segmentation in Digital Documents, including applicable and relevant approaches.
Plano de Trabalhos - Semestre 2
Semestre 2 (Tempo inteiro 40h/s): * [Máximo de 5000 caracteres]
- Solution Development (Agile) [February - June]
- Solution Experimentation [June]
- Thesis Writing [June - July]
Condições
The intern will work at the Centre for Cognitive Computing of Novabase, headquartered in Coimbra, in the middle of a dynamic innovation ecosystem. Integrated in a growing team of talented people that are passionate about what they do, never run away from a good problem and are committed to deliver with the highest quality. We offer a challenging environment, the opportunity to work on large projects with international customers and a career focused on personal development.
The internship will be supervised by a senior member of the team, which will ensure a smooth integration of the intern and provide guidance whenever needed.
The intern will be invited to join the team by the end of the internship, given that they demonstrate to be up to the challenge and are able to deliver the expected results.
The intern will receive an internship grant.
Orientador
Bruno Antunes
bruno.antunes@novabase.com 📩