MPEG-G
Adapted from Wikipedia ยท Adventurer experience
MPEG-G (ISO / IEC 23092) is a special set of rules created to help organize and share information about genes and DNA. It was made by working together with groups that set rules for technology and biotechnology. The purpose of MPEG-G is to make it easier to store, find, and protect information created by machines that can read lots of genetic data very quickly, like those used in high-throughput sequencing.
The standard has many parts, each handling something different. Some parts focus on making the data smaller to save space, while others deal with adding extra details or creating tools that computers can use to work with the data. There is also special software that can read this data. Since 2019, both commercial products and free programs have started to use these rules, and more parts of the standard are being added all the time.
Background
High-throughput sequencing (HTS) technologies have changed biology. Now, scientists have lots of information about genes. This helps in many areas, from research to medicine. Gene information is shared using different file types, like FASTA/FASTQ for raw data and SAM/BAM/CRAM for organized data. The ISO/IEC 23092 (MPEG-G) standard wants to make one format that works well for storing and sending this gene information. To do this, the standard has several parts.
Structure of the standard
The MPEG-G standard uses ideas from digital media to help store and move around information about genes. It can handle big amounts of data, even when some parts need to be kept private.
The standard is split into several parts, each focusing on different jobs:
ISO/IEC 23092-1 MPEG-G Part 1
ISO/IEC 23092-1 explains how gene data is organized for moving and storing. It defines formats for gene records, reference records, MPEG-G files, and transport streams.
ISO/IEC 23092-2 MPEG-G Part 2
ISO/IEC 23092-2 explains how to compress gene data without losing information, and also how to compress quality scores with some loss. It focuses on how to decode the data correctly.
ISO/IEC 23092-3 MPEG-G Part 3
ISO/IEC 23092-3 explains a format for extra information and provides ways for different tools and systems to work together. It also includes a section about connecting with existing SAM content.
ISO/IEC 23092-4 MPEG-G Part 4
ISO/IEC 23092-4 explains reference software for representing gene information. This includes both encoder and decoder software. The encoder software, called Genie, is open source and was developed by people from universities and companies around the world.
ISO/IEC 23092-5 MPEG-G Part 5
ISO/IEC 23092-5 explains how to check that devices and applications correctly follow the MPEG-G standard. It provides a way to test if everything works together properly.
| Part | Number | First public release date (First edition) | Latest public release date (edition) | Title | Description |
|---|---|---|---|---|---|
| Part 1 | ISO/IEC 23092-1 | 2019 | 2019 | Transport and Storage of Genomic Information | Specification of file format, streaming and indexing |
| Part 2 | ISO/IEC 23092-2 | 2019 | 2019 | Coding of Genomic Information | Compression of unmapped (raw) and aligned genome sequencing data |
| Part 3 | ISO/IEC 23092-3 | 2020 | 2020 | Metadata and Application Programming Interfaces (APIs) | Specification of standard interfaces, syntax for metadata and description of content protection mechanisms |
| Part 4 | ISO/IEC 23092-4 | (2020) | Reference Software | It describes the open source implementation of a normative decoder and informative encoder. It also provides compressed bitstreams that can be used for reference purposes. Note that other open source implementations developed by independent groups do exist | |
| Part 5 | ISO/IEC 23092-5 | (2020) | Conformance testing | It details the testing procedure and associated compressed reference bitstreams to be used when one wants to assess the conformance of a decoder implementation with the MPEG-G standard | |
| Part 6 | ISO/IEC 23092-6 | (2021) | Coding of genomic annotations | Compressed representation of genomic annotations โ that is, a number of heterogeneous data types associated with intervals of the reference genome that the sequencing data has been aligned to. |
| Functions Group | Brief Description |
|---|---|
| Genomic Information | Functions used to query the structure of, and retrieve, the genomic information coded in a bitstream compliant with ISO/IEC 23092 series. |
| Metadata | Functions used to query the structure of, and retrieve, the metadata associated with the coded genomic data. |
| Protection | Functions used to retrieve the protection metadata associated with the coded genomic data. |
| Reference | Functions used to retrieve the reference associated with a dataset. |
| Statistics | Functions used to retrieve statistics associated with a dataset. |
| Part | Number | Component |
|---|---|---|
| Part 1 | ISO/IEC 23092-1 | Encapsulation |
| Indexing | ||
| Part 2 | ISO/IEC 23092-2 | Classification |
| Reference engine | ||
| Quality value quantization | ||
| Descriptor subsequence generation | ||
| Transformations | ||
| Entropy encoding | ||
| Part 6 | ISO/IEC 23092-6 | (To be determined) |
MIME Type and Filename extensions
There is no special internet code (called a MIME type) for files that use the MPEG-G standard. There also are no common endings (like .mp3 for music) used for these files yet.
This article is a child-friendly adaptation of the Wikipedia article on MPEG-G, available under CC BY-SA 4.0.
Safekipedia