PhD Position at I3S laboratory, Sophia Antipolis, France
Robust long-term data storage in synthetic DNA
Missions:
Storage of digital data is becoming challenging for the humanity due to the relatively short life span of storage devices. At the same time, the "digital universe" (all digital data worldwide) is forecast to grow to over 175 zettabytes in 2025. A significant fraction of this data is called "cold" or infrequently accessed. Old photographs stored by users on Facebook is one such example of cold data; Facebook recently built an entire data center dedicated to storing such cold photographs. Unfortunately, all current storage media used for cold data storage (Hard Disk Drives or tape) suffer from two fundamental problems. First, the rate of improvement in storage density is at best 20% per year, which substantially lags behind the 60% rate of cold data growth. Second, current storage media have a limited lifetime of five (HDD) to twenty years (tape). As data is often stored for much longer duration (50 or more years) due to legal and regulatory compliance reasons, data must be migrated to new storage devices every few years, thus, increasing the price of data ownership.
An alternative approach may stem from the use of DNA, the support of heredity in living organisms. Using DNA to store cold data is an attractive possibility because it is extremely dense, with a raw limit of 1 exabyte/mm3, and long-lasting, with observed half-life much over 500 years. This comes from recent biotechnological developments allowing easy and affordable DNA writing (synthesis) and DNA reading (sequencing). However, one major problem of DNA storage is that all the information stored on DNA suffers the introduction of errors both in the synthesis and in the sequencing phase. Errors take the form of substitutions, insertions and deletions of single nucleotides. Concerning the introduction of errors, the most critical phase is the sequencing of the strands: in this case the choice of different sequencing machines results in significant fluctuations in the number of sequencing errors, since different techniques are available to tackle this task.
The project is carried out in the context of the PEPR project MoleculArXiv (https://www.ins2i.cnrs.fr/fr/pepr-molecularxiv) which aims to develop new devices and techniques for data storage on molecular media (artificial polymers and DNA). The aim of this PhD project is to bring important contributions to the development of the mathematical foundations for encoding and decoding information on molecular media, thereby enabling DNA as a replacement for devices like hard disks or tapes for archiving images. By relying on tools from graph theory and statistical learning, this PhD project focuses on developing new efficient techniques for data encoding/decoding in synthetic DNA. The aim is to develop novel signal processing and image compression techniques for enabling high-density storage of unstructured images in DNA. The proposed solutions should respect two main constraints: (i) the constructed DNA code should take into account chemical restrictions and (ii) the constructed DNA code should be robust to sequencing noise, i.e., errors introduced by the sequencing technology. The PhD project will be based on previous works developed by the MediaCoding research group of the I3S laboratory [1, 2, 3, 4, 5]. By investigating machine learning techniques and prior information on the DNA stored data distributions, the objective in the PhD project is to develop new robust techniques for efficiently decoding the synthetic DNA.
References:
1. Appuswamy R., Lebrigand K., Barbry P., Antonini M., Madderson, O., Freemont P., MacDonald J. and Heinis T. OligoArchive: Using DNA in the DBMS storage hierarchy. In Conference on Innovative Data Systems Research (CIDR), 2019.
2. Dimopoulou M., Antonini M., Barbry P., and Appuswamy R. A biologically constrained encoding solution for long-term storage of images onto synthetic DNA. EUSIPCO, Sep 2019, A Coruña, Spain.
3. Melpomeni Dimopoulou M., Marc Antonini, Pascal Barbry, Raja Appuswamy. Storing Digital Data into DNA: A Comparative Study of Quaternary Code Construction, ICASSP, May 2020, Barcelona, Spain
-
Melpomeni Dimopoulou, Marc Antonini. Efficient Storage of Images onto DNA Using Vector Quantization, Data Compression Conference, (DCC) 2020, Mar. 2020, Utah, United States
-
Dimopoulou M., Antonini M. Image storage in DNA using Vector Quantization, EUSIPCO, Sep. 2020, Amsterdam, The Netherlands.
-
Antonio E, Dimopoulou M., Antonini M., Pascal Barbry, Raja Appuswamy. Decoding of nanopore-sequenced synthetic DNA storing digital images, IEEE ICIP,
2021.
Activity:
The post holder will work at I3S laboratory in the SIS/MediaCoding research group (http://mediacoding.i3s.unice.fr). Potential collaborations with scientists from Imperial College London are expected. The project allows for some flexibility in the profile of applicants. Candidates with expertise in the following areas can be a good fit:
- Image coding,
- Machine learning and statistical modeling, - Graph theory.
An experience in the domain of DNA synthesis and sequencing will be appreciated.
All applicants should be able to demonstrate the following:
-
- A strong analytical background and research interest in topics related to signal processing, encoding/decoding, DNA synthesis and sequencing, graph and statistical learning theories,
-
- Solid programming skills,
-
- An ability to work with third-party software and to liaise constructively with the developers of such software,
-
- The ability to work independently and to drive both the research and software development agenda.
The successful applicant will have an MSc (or equivalent) in an area pertinent to the subject area.
Skills:
-
Highly motivated and good team player.
-
Master 2 degree in signal or image processing, data sciences or in a related discipline.
-
Experienced development skills Python or Matlab or C/C++.
-
Curiosity, open-mindedness, creativity, persistence, professionalism, responsibility, and a team player are the key personal skills
that we are looking for this position.
Contact:
Interested applicants should send their resumes with copies of their academic transcripts to Marc Antonini (am@i3s.unice.fr) and Roula Nassif (roula.nassif@i3s.unice.fr).
-