Data acquisition, extraction and storage

Crédit : 4 ECTS
Langue du cours : anglais

Volume horaire

  • CM : 24 h
  • Volume horaire global (hors stage) : 24 h

Compétences à acquérir

Understanding:
  • how to acquire data from a variety of sources and in a variety of formats
  • how to extract structured data from unstructured or semi-structured data
  • how to format, integrate, clean data sets
  • how to store and access data sets

Description du contenu de l'enseignement

The objective of this course is to present the principles and techniques used to acquire, extract, integrate, clean, preprocess, store, and query datasets, that may then be used as input data to train various artificial intelligence models. The course will consist on a mix of lectures and practical sessions. We will cover the following aspects:
  • Web data acquisition (Web crawling, Web APIs, open data, legal issues)
  • Information extraction from semi-structured data
  • Data cleaning and data deduplication
  • Data formats and data models
  • Storing and processing data in databases, in main memory, or in plain files
  • Introduction to large-scale data processing with MapReduce and Spark
  • Introduction to the management of uncertain data

Mode de contrôle des connaissances

Project (50% of the grade) and in-class written assessment (50% of the grade)

Pré-requis obligatoires

Basics of computer science and computer engineering (algorithms, databases, programming, logics, complexity).

Correspondant administratif

PIERRE SENELLART

pierre.senellart@ens.fr



Année universitaire 2023 - 2024 - Fiche modifiée le : 01-04-2026 (16H03) - Sous réserve de modification.