Industrial and academic experts predict that immersive applications will dominate the technology markets by 2020 . This expectation comes from the significant growth of interest that was observed during the previous years in the development of consumer and professional devices for acquisition (i.e., multi-lens, multi-camera, depth sensors) and rendering (i.e., mobile phones of high processing power, head-mounted displays), which are nowadays available in the market. One of the main representations that is extensively exploited in such applications is the omnidirectional visual content, as it allows users to visualize either static or dynamic scenes in a natural and immersive way. The omnidirectional, or 360-degree, visual content is, essentially, a collection of stitched images lying on the surface of a sphere with a given radius, while the user is placed in its center, simulating the visual perception of the physical environment. Based on the head movements of the user and the direction of watching, different parts of the spherical virtual content are displayed, which are called viewports.
This advanced representation, however, comes at the expense of increased coding complexity and bandwidth requirements. For instance, in the common case of consuming of 360-degree content using either a head-mounted display or a hand-held device, a viewport typically covers a viewing angle of 120 degrees horizontally. Furthermore, the resolution of the screen should be Ultra High Definition (UHD) in order to achieve high visual quality. This implies that, without enabling any smart algorithm, high quality spherical content of up to 12K (11520 × 6480) resolution should be delivered in the device. Moreover, in order to avoid simulator sickness, which is observed at low frames per second (fps), the screen refresh rate of the state-of-the-art devices should be 60 fps. In the case of static scenes, a still-image is delivered and, based on the head movements, the screen is updated every 1/60 sec showing the corresponding viewports. In the case of dynamic content, the video sequence should be also encoded with 60 fps for higher levels of immersiveness. Thus, it is obvious that a huge amount of data is required at the client side in order to ensure high quality of experience.
In media content delivery, typically, encoding precedes transmission in order to reduce the amount of streamed data. In the delivery of omnidirectional visual content, there is a certain peculiarity that can be also exploited in order to further reduce the required bandwidth. That is, considering that one user can only visualize a single viewport out of the whole spherical content at a given time instance, smart delivery algorithms may be deployed. Especially applications that are based on the delivery of 360-degree video sequences can benefit and achieve a marginal gain in terms of bandwidth reduction. Thus, this topic is hot and under extensive investigation nowadays. Until now, the state-of-the-art delivery algorithms for omnidirectional content are based on the principle of transmitting a part of the spherical content in high quality while the rest of the content either is not transmitted or it is of low quality. A feedback loop, or a two-sided connection is typically established between the client and the server; the client, thus, is able to notify the server about the current activity of each user and receives the corresponding streams. There are numerous variations of this general scheme, with the most popular outlined below:
- Transmission of only the part of the spherical video that corresponds to the viewport the user is currently watching. Theoretically, if it was plausible to accurately predict the exact head movements of an end-user at every time instance in the future, this solution would be the winning approach, as it enables bandwidth reduction of at least a factor of 3. In practice, however, it is impossible to predict with high accuracy the future activity of an end-user. Hence, this variation has a severe negative impact on the navigation of the user. In particular, when the user wants to visualize another region of the content, a new viewport should be immediately rendered. However, no other parts of the spherical content are received at the end-device and, thus, the server needs to be notified and the client needs to wait for the delivery of every new viewport.
- Division of the spherical content into independent regions, called tiles, and transmission of these independent streams. The server offers multiple representations of each tile while the client selects the desirable representation, following a DASH-like structure. The goal here is to encode the currently viewed tiles with the highest possible quality, whilst the other tiles quality is determined based on the probability of being displayed next. When the user wants to visualize another region of the content, lower visual quality may be experienced for a short time until the system is adjusted. The main drawbacks are that the client needs to firstly reconstruct the frame from the receiving tiles before extracting the viewport that corresponds to the current direction of watching, and by increasing the number of tiles less efficient compression takes place as the spatial redundancy is not effectively exploited. This approach enables fast navigation, but the bandwidth requirements and the complexity of both the client and the server are increased.
- Delivery of the whole spherical video sequence with each frame being divided in non-overlapping regions of different levels of quality. The regions of high quality correspond to potential central viewports, while the rest of the content is of low quality. Several versions of the same frame with a different central viewport of high quality are provided, and each client selects the central viewport that matches the corresponding head movements. As can be observed, this approach also follows a DASH-like architecture. Similar to the tile-based content delivery, fast navigation is achieved at the expense of increased complexity and bandwidth demands, and low image quality may be visualized for short time intervals. In this case, though, the reconstruction of each frame at the client side from multiple independent streams is avoided. The client is responsible just for the selection of the next frame(s) and the extraction of the current viewport. The computations that take place in the server side remain high, as multiple encoding sessions of the same frame should be simultaneously performed.
As can be seen the selection of the content delivery scheme depends on the targeted application and on the compromises of every provider, given the aforementioned trade-offs between bandwidth savings, quality of experience (i.e., interactivity and image quality) and complexity of the system (i.e., resource allocation in the server and the client). This has been always the case with emerging technologies. Higher demands bring higher expectations for better quality of experience, which in turn provide opportunities to address new challenges. The consumption of omnidirectional visual content is a topic that is extensively studied nowadays and it is exciting to follow these efforts for improvement which will shape the way this type of media will be created and consumed in the near future.
Best wishes from the ImmersiaTV team and Happy New Year!
 MPEG Experts. Summary of survey on virtual reality (m16542). ISO/IEC JTC1/SC 29//WG 11, Oct. 2016.
Author: Evangelos Alexiou, École Polytechnique Fédérale de Lausanne.