U.S. Pat. No. 11,336,926
SYSTEM AND METHOD FOR REMOTE-HOSTED VIDEO GAME STREAMING AND FEEDBACK FROM CLIENT ON RECEIVED FRAMES
AssigneeSony Interactive Entertainment LLC
Issue DateAugust 6, 2019
Illustrative Figure
Abstract
A method and system are provided for streaming a video game from a server to a client. One example system includes the server configured to generate video frames for the video game responsive to input received from the client. An encoder that processes the video frames to generate compressed video frames and storing past encoder states in memory associated with the encoder. The server is configured to transmit the compressed video frames to the client. The server is configured to receive a feedback signal from the client to determine when one or more of the compressed video frames that were sent were not been received by the client. The encoder is configured to generate one or more next video frames as compressed video frames that are dependent on compressed video frames that are known to have been successfully received based on the feedback signal.
Description
DESCRIPTION OF EXAMPLE EMBODIMENTS In the following description specific details are set forth, such as device types, system configurations, communication methods, etc., in order to provide a thorough understanding of the present disclosure. However, persons having ordinary skill in the relevant arts will appreciate that these specific details may not be needed to practice the embodiments described. FIGS. 2a-bprovide a high-level architecture of two embodiments in which video games and software applications are hosted by a hosting service210and accessed by client devices205at user premises211(note that the “user premises” means the place wherever the user is located, including outdoors if using a mobile device) over the Internet206(or other public or private network) under a subscription service. The client devices205may be general-purpose computers such as Microsoft Windows- or Linux-based PCs or Apple, Inc. Macintosh computers with a wired or wireless connection to the Internet either with internal or external display device222, or they may be dedicated client devices such as a set-top box (with a wired or wireless connection to the Internet) that outputs video and audio to a monitor or TV set222, or they may be mobile devices, presumably with a wireless connection to the Internet. Any of these devices may have their own user input devices (e.g., keyboards, buttons, touch screens, track pads or inertial-sensing wands, video capture cameras and/or motion-tracking cameras, etc.), or they may use external input devices221(e.g., keyboards, mice, game controllers, inertial sensing wand, video capture cameras and/or motion tracking cameras, etc.), connected with wires or wirelessly. As described in greater detail below, the hosting service210includes servers of various levels of performance, including those with high-powered CPU/GPU processing capabilities. During playing of a game or use of an application on the hosting service210, a home or office client device205receives keyboard and/or controller input from the user, and ...
DESCRIPTION OF EXAMPLE EMBODIMENTS
In the following description specific details are set forth, such as device types, system configurations, communication methods, etc., in order to provide a thorough understanding of the present disclosure. However, persons having ordinary skill in the relevant arts will appreciate that these specific details may not be needed to practice the embodiments described.
FIGS. 2a-bprovide a high-level architecture of two embodiments in which video games and software applications are hosted by a hosting service210and accessed by client devices205at user premises211(note that the “user premises” means the place wherever the user is located, including outdoors if using a mobile device) over the Internet206(or other public or private network) under a subscription service. The client devices205may be general-purpose computers such as Microsoft Windows- or Linux-based PCs or Apple, Inc. Macintosh computers with a wired or wireless connection to the Internet either with internal or external display device222, or they may be dedicated client devices such as a set-top box (with a wired or wireless connection to the Internet) that outputs video and audio to a monitor or TV set222, or they may be mobile devices, presumably with a wireless connection to the Internet.
Any of these devices may have their own user input devices (e.g., keyboards, buttons, touch screens, track pads or inertial-sensing wands, video capture cameras and/or motion-tracking cameras, etc.), or they may use external input devices221(e.g., keyboards, mice, game controllers, inertial sensing wand, video capture cameras and/or motion tracking cameras, etc.), connected with wires or wirelessly. As described in greater detail below, the hosting service210includes servers of various levels of performance, including those with high-powered CPU/GPU processing capabilities. During playing of a game or use of an application on the hosting service210, a home or office client device205receives keyboard and/or controller input from the user, and then it transmits the controller input through the Internet206to the hosting service210that executes the gaming program code in response and generates successive frames of video output (a sequence of video images) for the game or application software (e.g., if the user presses a button which would direct a character on the screen to move to the right, the game program would then create a sequence of video images showing the character moving to the right). This sequence of video images is then compressed using a low-latency video compressor, and the hosting service210then transmits the low-latency video stream through the Internet206. The home or office client device then decodes the compressed video stream and renders the decompressed video images on a monitor or TV. Consequently, the computing and graphical hardware requirements of the client device205are significantly reduced. The client205only needs to have the processing power to forward the keyboard/controller input to the Internet206and decode and decompress a compressed video stream received from the Internet206, which virtually any personal computer is capable of doing today in software on its CPU (e.g., a Intel Corporation Core Duo CPU running at approximately 2 GHz is capable of decompressing 720p HDTV encoded using compressors such as H.264 and Windows Media VC9). And, in the case of any client devices, dedicated chips can also perform video decompression for such standards in real-time at far lower cost and with far less power consumption than a general-purpose CPU such as would be required for a modern PC. Notably, to perform the function of forwarding controller input and decompressing video, home client devices205do not require any specialized graphics processing units (GPUs), optical drive or hard drives, such as the prior art video game system shown inFIG. 1.
As games and applications software become more complex and more photo-realistic, they will require higher-performance CPUs, GPUs, more RAM, and larger and faster disk drives, and the computing power at the hosting service210may be continually upgraded, but the end user will not be required to update the home or office client platform205since its processing requirements will remain constant for a display resolution and frame rate with a given video decompression algorithm. Thus, the hardware limitations and compatibility issues seen today do not exist in the system illustrated inFIGS. 2a-b.
Further, because the game and application software executes only in servers in the hosting service210, there never is a copy of the game or application software (either in the form of optical media, or as downloaded software) in the user's home or office (“office” as used herein unless otherwise qualified shall include any non-residential setting, including, schoolrooms, for example). This significantly mitigates the likelihood of a game or application software being illegally copied (pirated), as well as mitigating the likelihood of a valuable database that might be use by a game or applications software being pirated. Indeed, if specialized servers are required (e.g., requiring very expensive, large or noisy equipment) to play the game or application software that are not practical for home or office use, then even if a pirated copy of the game or application software were obtained, it would not be operable in the home or office.
In one embodiment, the hosting service210provides software development tools to the game or application software developers (which refers generally to software development companies, game or movie studios, or game or applications software publishers)220which design video games so that they may design games capable of being executed on the hosting service210. Such tools allow developers to exploit features of the hosting service that would not normally be available in a standalone PC or game console (e.g., fast access to very large databases of complex geometry (“geometry” unless otherwise qualified shall be used herein to refer to polygons, textures, rigging, lighting, behaviors and other components and parameters that define 3D datasets)).
Different business models are possible under this architecture. Under one model, the hosting service210collects a subscription fee from the end user and pays a royalty to the developers220, as shown inFIG. 2a. In an alternate implementation, shown inFIG. 2b, the developers220collects a subscription fee directly from the user and pays the hosting service210for hosting the game or application content. These underlying principles are not limited to any particular business model for providing online gaming or application hosting.
Compressed Video Characteristics
As discussed previously, one significant problem with providing video game services or applications software services online is that of latency. A latency of 70-80 ms (from the point a input device is actuated by the user to the point where a response is displayed on the display device) is at the upper limit for games and applications requiring a fast response time. However, this is very difficult to achieve in the context of the architecture shown inFIGS. 2aand 2bdue to a number of practical and physical constraints.
As indicated inFIG. 3, when a user subscribes to an Internet service, the connection is typically rated by a nominal maximum data rate301to the user's home or office. Depending on the provider's policies and routing equipment capabilities, that maximum data rate may be more or less strictly enforced, but typically the actual available data rate is lower for one of many different reasons. For example, there may be too much network traffic at the DSL central office or on the local cable modem loop, or there may be noise on the cabling causing dropped packets, or the provider may establish a maximum number of bits per month per user. Currently, the maximum downstream data rate for cable and DSL services typically ranges from several hundred Kilobits/second (Kbps) to 30 Mbps. Cellular services are typically limited to hundreds of Kbps of downstream data. However, the speed of the broadband services and the number of users who subscribe to broadband services will increase dramatically over time. Currently, some analysts estimate that 33% of US broadband subscribers have a downstream data rate of 2 Mbps or more. For example, some analysts predict that by 2010, over 85% of US broadband subscribers will have a data rate of 2 Mbps or more.
As indicated inFIG. 3, the actual available max data rate302may fluctuate over time. Thus, in a low-latency, online gaming or application software context it is sometimes difficult to predict the actual available data rate for a particular video stream. If the data rate303required to sustain a given level of quality at given number of frames-per-second (fps) at a given resolution (e.g., 640×480 @ 60 fps) for a certain amount of scene complexity and motion rises above the actual available max data rate302(as indicated by the peak inFIG. 3), then several problems may occur. For example, some internet services will simply drop packets, resulting in lost data and distorted/lost images on the user's video screen. Other services will temporarily buffer (i.e., queue up) the additional packets and provide the packets to the client at the available data rate, resulting in an increase in latency—an unacceptable result for many video games and applications. Finally, some Internet service providers will view the increase in data rate as a malicious attack, such as a denial of service attack (a well known technique used by hackers to disable network connections), and will cut off the user's Internet connection for a specified time period. Thus, the embodiments described herein take steps to ensure that the required data rate for a video game does not exceed the maximum available data rate.
Hosting Service Architecture
FIG. 4aillustrates an architecture of the hosting service210according to one embodiment. The hosting service210can either be located in a single server center, or can be distributed across a plurality of server centers (to provide for lower latency connections to users that have lower latency paths to certain server centers than others, to provide for load balancing amongst users, and to provide for redundancy in the case one or more server centers fail). The hosting service210may eventually include hundreds of thousands or even millions of servers402, serving a very large user base. A hosting service control system401provides overall control for the hosting service210, and directs routers, servers, video compression systems, billing and accounting systems, etc. In one embodiment, the hosting service control system401is implemented on a distributed processing Linux-based system tied to RAID arrays used to store the databases for user information, server information, and system statistics. In the foregoing descriptions, the various actions implemented by the hosting service210, unless attributed to other specific systems, are initiated and controlled by the hosting service control system401.
The hosting service210includes a number of servers402such as those currently available from Intel, IBM and Hewlett Packard, and others. Alternatively, the servers402can be assembled in a custom configuration of components, or can eventually be integrated so an entire server is implemented as a single chip. Although this diagram shows a small number of servers402for the sake of illustration, in an actual deployment there may be as few as one server402or as many as millions of servers402or more. The servers402may all be configured in the same way (as an example of some of the configuration parameters, with the same CPU type and performance; with or without a GPU, and if with a GPU, with the same GPU type and performance; with the same number of CPUs and GPUs; with the same amount of and type/speed of RAM; and with the same RAM configuration), or various subsets of the servers402may have the same configuration (e.g., 25% of the servers can be configured a certain way, 50% a different way, and 25% yet another way), or every server402may be different.
In one embodiment, the servers402are diskless, i.e., rather than having its own local mass storage (be it optical or magnetic storage, or semiconductor-based storage such as Flash memory or other mass storage means serving a similar function), each server accesses shared mass storage through fast backplane or network connection. In one embodiment, this fast connection is a Storage Area Network (SAN)403connected to a series of Redundant Arrays of Independent Disks (RAID)405with connections between devices implemented using Gigabit Ethernet. As is known by those of skill in the art, a SAN403may be used to combine many RAID arrays405together, resulting in extremely high bandwidth—approaching or potentially exceeding the bandwidth available from the RAM used in current gaming consoles and PCs. And, while RAID arrays based on rotating media, such as magnetic media, frequently have significant seek-time access latency, RAID arrays based on semiconductor storage can be implemented with much lower access latency. In another configuration, some or all of the servers402provide some or all of their own mass storage locally. For example, a server402may store frequently-accessed information such as its operating system and a copy of a video game or application on low-latency local Flash-based storage, but it may utilize the SAN to access RAID Arrays405based on rotating media with higher seek latency to access large databases of geometry or game state information on a less frequent bases.
In addition, in one embodiment, the hosting service210employs low-latency video compression logic404described in detail below. The video compression logic404may be implemented in software, hardware, or any combination thereof (certain embodiments of which are described below). Video compression logic404includes logic for compressing audio as well as visual material.
In operation, while playing a video game or using an application at the user premises211via a keyboard, mouse, game controller or other input device421, control signal logic413on the client415transmits control signals406a-b(typically in the form of UDP packets) representing the button presses (and other types of user inputs) actuated by the user to the hosting service210. The control signals from a given user are routed to the appropriate server (or servers, if multiple servers are responsive to the user's input device)402. As illustrated inFIG. 4a, control signals406amay be routed to the servers402via the SAN. Alternatively or in addition, control signals406bmay be routed directly to the servers402over the hosting service network (e.g., an Ethernet-based local area network). Regardless of how they are transmitted, the server or servers execute the game or application software in response to the control signals406a-b. Although not illustrated inFIG. 4a, various networking components such as a firewall(s) and/or gateway(s) may process incoming and outgoing traffic at the edge of the hosting service210(e.g., between the hosting service210and the Internet410) and/or at the edge of the user premises211between the Internet410and the home or office client415. The graphical and audio output of the executed game or application software—i.e., new sequences of video images—are provided to the low-latency video compression logic404which compresses the sequences of video images according to low-latency video compression techniques, such as those described herein and transmits a compressed video stream, typically with compressed or uncompressed audio, back to the client415over the Internet410(or, as described below, over an optimized high speed network service that bypasses the general Internet). Low-latency video decompression logic412on the client415then decompresses the video and audio streams and renders the decompressed video stream, and typically plays the decompressed audio stream, on a display device422Alternatively, the audio can be played on speakers separate from the display device422or not at all. Note that, despite the fact that input device421and display device422are shown as free-standing devices inFIGS. 2aand 2b, they may be integrated within client devices such as portable computers or mobile devices.
Home or office client415(described previously as home or office client205inFIGS. 2aand 2b) may be a very inexpensive and low-power device, with very limited computing or graphics performance and may well have very limited or no local mass storage. In contrast, each server402, coupled to a SAN403and multiple RAIDs405can be an exceptionally high performance computing system, and indeed, if multiple servers are used cooperatively in a parallel-processing configuration, there is almost no limit to the amount of computing and graphics processing power that can be brought to bear. And, because of the low-latency video compression404and low-latency video compression412, perceptually to the user, the computing power of the servers402is being provided to the user. When the user presses a button on input device421, the image on display422is updated in response to the button press perceptually with no meaningful delay, as if the game or application software were running locally. Thus, with a home or office client415that is a very low performance computer or just an inexpensive chip that implements the low-latency video decompression and control signal logic413, a user is provided with effectively arbitrary computing power from a remote location that appears to be available locally. This gives users the power to play the most advanced, processor-intensive (typically new) video games and the highest performance applications.
FIG. 4cshows a very basic and inexpensive home or office client device465. This device is an embodiment of home or office client415fromFIGS. 4aand 4b. It is approximately 2 inches long. It has an Ethernet jack462that interfaces with an Ethernet cable with Power over Ethernet (PoE), from which it derives its power and its connectivity to the Internet. It is able to run Network Address Translation (NAT) within a network that supports NAT. In an office environment, many new Ethernet switches have PoE and bring PoE directly to a Ethernet jack in an office. It such a situation, all that is required is an Ethernet cable from the wall jack to the client465. If the available Ethernet connection does not carry power (e.g., in a home with a DSL or cable modem, but no PoE), then there are inexpensive wall “bricks” (i.e., power supplies) available that will accept an unpowered Ethernet cable and output Ethernet with PoE.
The client465contains control signal logic413(ofFIG. 4a) that is coupled to a Bluetooth wireless interface, which interfaces with Bluetooth input devices479, such as a keyboard, mouse, game controller and/or microphone and/or headset. Also, one embodiment of client465is capable of outputting video at 120 fps coupled with a display device468able to support 120 fps video and signal (typically through infrared) a pair of shuttered glasses466to alternately shutter one eye, then the other with each successive frame. The effect perceived by the user is that of a stereoscopic 3D image that “jumps out” of the display screen. One such display device468that supports such operation is the Samsung HL-T5076S. Since the video stream for each eye is separate, in one embodiment two independent video streams are compressed by the hosting service210, the frames are interleaved in time, and the frames are decompressed as two independent decompression processes within client465.
The client465also contains low latency video decompression logic412, which decompresses the incoming video and audio and output through the HDMI (High-Definition Multimedia Interface), connector463which plugs into an SDTV (Standard Definition Television) or HDTV (High Definition Television)468, providing the TV with video and audio, or into a monitor468that supports HDMI. If the user's monitor468does not support HDMI, then an HDMI-to-DVI (Digital Visual Interface) can be used, but the audio will be lost. Under the HDMI standard, the display capabilities (e.g. supported resolutions, frame rates)464are communicated from the display device468, and this information is then passed back through the Internet connection462back to the hosting service210so it can stream compressed video in a format suitable for the display device.
FIG. 4dshows a home or office client device475that is the same as the home or office client device465shown inFIG. 4cexcept that is has more external interfaces. Also, client475can accept either PoE for power, or it can run off of an external power supply adapter (not shown) that plugs in the wall. Using client475USB input, video camera477provides compressed video to client475, which is uploaded by client475to hosting service210for use described below. Built into camera477is a low-latency compressor utilizing the compression techniques described below.
In addition to having an Ethernet connector for its Internet connection, client475also has an 802.11g wireless interface to the Internet. Both interfaces are able to use NAT within a network that supports NAT.
Also, in addition to having an HDMI connector to output video and audio, client475also has a Dual Link DVI-I connector, which includes analog output (and with a standard adapter cable will provide VGA output). It also has analog outputs for composite video and S-video.
For audio, the client475has left/right analog stereo RCA jacks, and for digital audio output it has a TOSLINK output.
In addition to a Bluetooth wireless interface to input devices479, it also has USB jacks to interface to input devices.
FIG. 4eshows one embodiment of the internal architecture of client465. Either all or some of the devices shown in the diagram can be implemented in a Field Programmable Logic Array, a custom ASIC or in several discrete devices, either custom designed or off-the-shelf.
Ethernet with PoE497attaches to Ethernet Interface481. Power499is derived from the Ethernet with PoE497and is connected to the rest of the devices in the client465. Bus480is a common bus for communication between devices.
Control CPU483(almost any small CPU, such as a MIPS R4000 series CPU at 100 MHz with embedded RAM is adequate) running a small client control application from Flash476implements the protocol stack for the network (i.e. Ethernet interface) and also communicates with the Hosting Service210, and configures all of the devices in the client465. It also handles interfaces with the input devices469and sends packets back to the hosting service210with user controller data, protected by Forward Error Correction, if necessary. Also, Control CPU483monitors the packet traffic (e.g. if packets are lost or delayed and also timestamps their arrival). This information is sent back to the hosting service210so that it can constantly monitor the network connection and adjust what it sends accordingly. Flash memory476is initially loaded at the time of manufacture with the control program for Control CPU483and also with a serial number that is unique to the particular Client465unit. This serial number allows the hosting service210to uniquely identify the Client465unit.
Bluetooth interface484communicates to input devices469wirelessly through its antenna, internal to client465.
Video decompressor486is a low-latency video decompressor configured to implement the video decompression described herein. A large number of video decompression devices exist, either off-the-shelf, or as Intellectual Property (IP) of a design that can be integrated into an FPGA or a custom ASIC. One company offering IP for an H.264 decoder is Ocean Logic of Manly, NSW Australia. The advantage of using IP is that the compression techniques used herein do not conform to compression standards. Some standard decompressors are flexible enough to be configured to accommodate the compression techniques herein, but some cannot. But, with IP, there is complete flexibility in redesigning the decompressor as needed.
The output of the video decompressor is coupled to the video output subsystem487, which couples the video to the video output of the HDMI interface490.
The audio decompression subsystem488is implemented either using a standard audio decompressor that is available, or it can be implemented as IP, or the audio decompression can be implemented within the control processor483which could, for example, implement the Vorbis audio decompressor (available at Vorbis.com).
The device that implements the audio decompression is coupled to the audio output subsystem489that couples the audio to the audio output of the HDMI interface490
FIG. 4fshows one embodiment of the internal architecture of client475. As can be seen, the architecture is the same as that of client465except for additional interfaces and optional external DC power from a power supply adapter that plugs in the wall, and if so used, replaces power that would come from the Ethernet PoE497. The functionality that is in common with client465will not be repeated below, but the additional functionality is described as follows.
CPU483communicates with and configures the additional devices.
WiFi subsystem482provides wireless Internet access as an alternative to Ethernet497through its antenna. WiFi subsystems are available from a wide range of manufacturers, including Atheros Communications of Santa Clara, Calif.
USB subsystem485provides an alternative to Bluetooth communication for wired USB input devices479. USB subsystems are quite standard and readily available for FPGAs and ASICs, as well as frequently built into off-the-shelf devices performing other functions, like video decompression.
Video output subsystem487produces a wider range of video outputs than within client465. In addition to providing HDMI490video output, it provides DVI-I491, S-video492, and composite video493. Also, when the DVI-I491interface is used for digital video, display capabilities464are passed back from the display device to the control CPU483so that it can notify the hosting service210of the display device478capabilities. All of the interfaces provided by the video output subsystem487are quite standard interfaces and readily available in many forms.
Audio output subsystem489outputs audio digitally through digital interface494(S/PDIF and/or Toslink) and audio in analog form through stereo analog interface495.
Round-Trip Latency Analysis
Of course, for the benefits of the preceding paragraph to be realized, the round trip latency between a user's action using input device421and seeing the consequence of that action on display device420should be no more than 70-80 ms. This latency must take into account all of the factors in the path from input device421in the user premises211to hosting service210and back again to the user premises211to display device422.FIG. 4billustrates the various components and networks over which signals must travel, and above these components and networks is a timeline that lists exemplary latencies that can be expected in a practical implementation. Note thatFIG. 4bis simplified so that only the critical path routing is shown. Other routing of data used for other features of the system is described below. Double-headed arrows (e.g., arrow453) indicate round-trip latency and a single-headed arrows (e.g., arrow457) indicate one-way latency, and “˜” denotes an approximate measure. It should be pointed out that there will be real-world situations where the latencies listed cannot be achieved, but in a large number of cases in the US, using DSL and cable modem connections to the user premises211, these latencies can be achieved in the circumstances described in the next paragraph. Also, note that, while cellular wireless connectivity to the Internet will certainly work in the system shown, most current US cellular data systems (such as EVDO) incur very high latencies and would not be able to achieve the latencies shown inFIG. 4b. However, these underlying principles may be implemented on future cellular technologies that may be capable of implementing this level of latency. Further, there are game and application scenarios (e.g., games that do not require fast user reaction time, such as chess) where the latency incurred through a current US cellular data system, while noticeable to the user, would be acceptable for the game or application.
Starting from the input device421at user premises211, once the user actuates the input device421, a user control signal is sent to client415(which may be a standalone device such a set-top box, or it may be software or hardware running in another device such as a PC or a mobile device), and is packetized (in UDP format in one embodiment) and the packet is given a destination address to reach hosting service210. The packet will also contain information to indicate which user the control signals are coming from. The control signal packet(s) are then forwarded through Firewall/Router/NAT (Network Address Translation) device443to WAN interface442. WAN interface442is the interface device provided to the user premises211by the User's ISP (Internet Service Provider). The WAN interface442may be a Cable or DSL modem, a WiMax transceiver, a Fiber transceiver, a Cellular data interface, an Internet Protocol-over-powerline interface, or any other of many interfaces to the Internet. Further, Firewall/Router/NAT device443(and potentially WAN interface442) may be integrated into the client415. An example of this would be a mobile phone, which includes software to implement the functionality of home or office client415, as well as the means to route and connect to the Internet wirelessly through some standard (e.g., 802.11g).
WAN Interface442then routes the control signals to what shall be called herein the “point of presence”441for the user's Internet Service Provider (ISP) which is the facility that provides an interface between the WAN transport connected to the user premises211and the general Internet or private networks. The point of presence's characteristics will vary depending upon nature of the Internet service provided. For DSL, it typically will be a telephone company Central Office where a DSLAM is located. For cable modems, it typically will be a cable Multi-System Operator (MSO) head end. For cellular systems, it typically will be a control room associated with cellular tower. But whatever the point of presence's nature, it will then route the control signal packet(s) to the general Internet410. The control signal packet(s) will then be routed to the WAN Interface441to the hosting service210, through what most likely will be a fiber transceiver interface. The WAN441will then route the control signal packets to routing logic409(which may be implemented in many different ways, including Ethernet switches and routing servers), which evaluates the user's address and routes the control signal(s) to the correct server402for the given user.
The server402then takes the control signals as input for the game or application software that is running on the server402and uses the control signals to process the next frame of the game or application. Once the next frame is generated, the video and audio is output from server402to video compressor404. The video and audio may be output from server402to compressor404through various means. To start with, compressor404may be built into server402, so the compression may be implemented locally within server402. Or, the video and/or audio may be output in packetized form through a network connection such as an Ethernet connection to a network that is either a private network between server402and video compressor404, or a through a shared network, such as SAN403. Or, the video may be output through a video output connector from server402, such as a DVI or VGA connector, and then captured by video compressor404. Also, the audio may be output from server402as either digital audio (e.g., through a TOSLINK or S/PDIF connector) or as analog audio, which is digitized and encoded by audio compression logic within video compressor404.
Once video compressor404has captured the video frame and the audio generated during that frame time from server402, the video compressor will compress the video and audio using techniques described below. Once the video and audio is compressed it is packetized with an address to send it back to the user's client415, and it is routed to the WAN Interface441, which then routes the video and audio packets through the general Internet410, which then routes the video and audio packets to the user's ISP point of presence441, which routes the video and audio packets to the WAN Interface442at the user's premises, which routes the video and audio packets to the Firewall/Router/NAT device443, which then routes the video and audio packets to the client415.
The client415decompresses the video and audio, and then displays the video on the display device422(or the client's built-in display device) and sends the audio to the display device422or to separate amplifier/speakers or to an amplifier/speakers built in the client.
For the user to perceive that the entire process just described is perceptually without lag, the round-trip delay needs be less than 70 or 80 ms. Some of the latency delays in the described round-trip path are under the control of the hosting service210and/or the user and others are not. Nonetheless, based on analysis and testing of a large number of real-world scenarios, the following are approximate measurements.
The one-way transmission time to send the control signals451is typically less than 1 ms, the roundtrip routing through the user premises452is typically accomplished, using readily available consumer-grade Firewall/Router/NAT switches over Ethernet in about 1 ms. User ISPs vary widely in their round trip delays453, but with DSL and cable modem providers, we typically see between 10 and 25 ms. The round trip latency on the general Internet410can vary greatly depending on how traffic is routed and whether there are any failures on the route (and these issues are discussed below), but typically the general Internet provides fairly optimal routes and the latency is largely determined by speed of light through optical fiber, given the distance to the destination. As discussed further below, we have established 1000 miles as a roughly the furthest distance that we expect to place a hosting service210away from user premises211. At 1000 miles (2000 miles round trip) the practical transit time for a signal through the Internet is approximately 22 ms. The WAN Interface441to the hosting service210is typically a commercial-grade fiber high speed interface with negligible latency. Thus, the general Internet latency454is typically between 1 and 10 ms. The one-way routing455latency through the hosting service210can be achieved in less than 1 ms. The server402will typically compute a new frame for a game or an application in less than one frame time (which at 60 fps is 16.7 ms) so 16 ms is a reasonable maximum one-way latency456to use. In an optimized hardware implementation of the video compression and audio compression algorithms described herein, the compression457can be completed in 1 ms. In less optimized versions, the compression may take as much as 6 ms (of course even less optimized versions could take longer, but such implementations would impact the overall latency of the round trip and would require other latencies to be shorter (e.g., the allowable distance through the general Internet could be reduced) to maintain the 70-80 ms latency target). The round trip latencies of the Internet454, User ISP453, and User Premises Routing452have already been considered, so what remains is the video decompression458latency which, depending on whether the video decompression458is implemented in dedicated hardware, or if implemented in software on a client device415(such as a PC or mobile device) it can vary depending upon the size of the display and the performance of the decompressing CPU. Typically, decompression458takes between 1 and 8 ms.
Thus, by adding together all of the worst-case latencies seen in practice, we can determine the worst-case round trip latency that can be expected to be experience by a user of the system shown inFIG. 4a. They are: 1+1+25+22+1+16+6+8=80 ms. And, indeed, in practice (with caveats discussed below), this is roughly the round trip latency seen using prototype versions of the system shown inFIG. 4a, using off-the-shelf Windows PCs as client devices and home DSL and cable modem connections within the US. Of course, scenarios better than worst case can result in much shorter latencies, but they cannot be relied upon in developing a commercial service that is used widely.
To achieve the latencies listed inFIG. 4bover the general Internet requires the video compressor404and video decompressor412fromFIG. 4ain the client415to generate a packet stream which very particular characteristics, such that the packet sequence generated through entire path from the hosting service210to the display device422is not subject to delays or excessive packet loss and, in particular, consistently falls with the constraints of the bandwidth available to the user over the user's Internet connection through WAN interface442and Firewall/Router/NAT443. Further, the video compressor must create a packet stream which is sufficiently robust so that it can tolerate the inevitable packet loss and packet reordering that occurs in normal Internet and network transmissions.
Low-Latency Video Compression
To accomplish the foregoing goals, one embodiment takes a new approach to video compression which decreases the latency and the peak bandwidth requirements for transmitting video. Prior to the description of these embodiments, an analysis of current video compression techniques will be provided with respect toFIG. 5andFIGS. 6a-b. Of course, these techniques may be employed in accordance with underlying principles if the user is provided with sufficient bandwidth to handle the data rate required by these techniques. Note that audio compression is not addressed herein other than to state that it is implemented simultaneously and in synchrony with the video compression. Prior art audio compression techniques exist that satisfy the requirements for this system.
FIG. 5illustrates one particular prior art technique for compressing video in which each individual video frame501-503is compressed by compression logic520using a particular compression algorithm to generate a series of compressed frames511-513. One embodiment of this technique is “motion JPEG” in which each frame is compressed according to a Joint Pictures Expert Group (JPEG) compression algorithm, based upon the discrete cosine transform (DCT). Various different types of compression algorithms may be employed, however, while still complying with these underlying principles (e.g., wavelet-based compression algorithms such as JPEG-2000).
One problem with this type of compression is that it reduces the data rate of each frame, but it does not exploit similarities between successive frames to reduce the data rate of the overall video stream. For example, as illustrated inFIG. 5, assuming a frame rate of 640×480×24 bits/pixel=640*480*24/8/1024=900 Kilobytes/frame (KB/frame), for a given quality of image, motion JPEG may only compress the stream by a factor of 10, resulting in a data stream of 90 KB/frame. At 60 frames/sec, this would require a channel bandwidth of 90 KB*8 bits*60 frames/sec=42.2 Mbps, which would be far too high bandwidth for almost all home Internet connections in the US today, and too high bandwidth for many office Internet connections. Indeed, given that it would demand a constant data stream at such a high bandwidth, and it would be just serving one user, even in an office LAN environment, it would consume a large percentage of a 100 Mbps Ethernet LAN's bandwidth and heavily burden Ethernet switches supporting the LAN. Thus, the compression for motion video is inefficient when compared with other compression techniques (such as those described below). Moreover, single frame compression algorithms like JPEG and JPEG-2000 that use lossy compression algorithms produce compression artifacts that may not be noticeable in still images (e.g., an artifact within dense foliage in the scene may not appear as an artifact since the eye does not know exactly how the dense foliage should appear). But, once the scene is in motion, an artifact can stand out because the eye detects that the artifact changed from frame-to-frame, despite the fact the artifact is in an area of the scene where it might not have been noticeable in a still image. This results in the perception of “background noise” in the sequence of frames, similar in appearance to the “snow” noise visible during marginal analog TV reception. Of course, this type of compression may still be used in certain embodiments described herein, but generally speaking, to avoid background noise in the scene, a high data rate (i.e., a low compression ratio) is required for a given perceptual quality.
Other types of compression, such as H.264, or Windows Media VC9, MPEG2 and MPEG4 are all more efficient at compressing a video stream because they exploit the similarities between successive frames. These techniques all rely upon the same general techniques to compress video. Thus, although the H.264 standard will be described, the same general principles apply to various other compression algorithms. A large number of H.264 compressors and decompressor are available, including the x264 open source software library for compressing H.264 and the FFmpeg open source software libraries for decompressing H.264.
FIGS. 6aand 6billustrate an exemplary prior art compression technique in which a series of uncompressed video frames501-503,559-561are compressed by compression logic620into a series of “I frames”611,671; “P frames”612-613; and “B frames”670. The vertical axis inFIG. 6agenerally signifies the resulting size of each of the encoded frames (although the frames are not drawn to scale). As described above, video coding using I frames, B frames and P frames is well understood by those of skill in the art. Briefly, an I frame611is a DCT-based compression of a complete uncompressed frame501(similar to a compressed JPEG image as described above). P frames612-613generally are significantly smaller in size than I frames611because they take advantage of the data in the previous I frame or P frame; that is, they contain data indicating the changes between the previous I frame or P frame. B frames670are similar to that of P frames except that B frames use the frame in the following reference frame as well as potentially the frame in the preceding reference frame.
For the following discussion, it will be assumed that the desired frame rate is 60 frames/second, that each I frame is approximately 160 Kb, the average P frame and B frame is 16 Kb and that a new I frame is generated every second. With this set of parameters, the average data rate would be: 160 Kb+16 Kb*59=1.1 Mbps. This data rate falls well within the maximum data rate for many current broadband Internet connections to homes and offices. This technique also tends to avoid the background noise problem from intraframe-only encoding because the P and B frames track differences between the frames, so compression artifacts tend not to appear and disappear from frame-to-frame, thereby reducing the background noise problem described above.
One problem with the foregoing types of compression is that although the average data rate is relatively low (e.g., 1.1 Mbps), a single I frame may take several frame times to transmit. For example, using prior art techniques a 2.2 Mbps network connection (e.g., DSL or cable modem with 2.2 Mbps peak of max available data rate302fromFIG. 3a) would typically be adequate to stream video at 1.1 Mbps with a 160 Kbps I frame each 60 frames. This would be accomplished by having the decompressor queue up 1 second of video before decompressing the video. In 1 second, 1.1 Mb of data would be transmitted, which would be easily accommodated by a 2.2 Mbps max available data rate, even assuming that the available data rate might dip periodically by as much as 50%. Unfortunately, this prior art approach would result in a 1-second latency for the video because of the 1-second video buffer at the receiver. Such a delay is adequate for many prior art applications (e.g., the playback of linear video), but is far too long a latency for fast action video games which cannot tolerate more than 70-80 ms of latency.
If an attempt were made to eliminate the 1-second video buffer, it still would not result in an adequate reduction in latency for fast action video games. For one, the use of B frames, as previously described, would necessitate the reception of all of the B frames preceding an I frame as well as the I frame. If we assume the 59 non-I frames are roughly split between P and B frames, then there would be at least 29 B frames and an I frame received before any B frame could be displayed. Thus, regardless of the available bandwidth of the channel, it would necessitate a delay of 29+1=30 frames of 1/60thsecond duration each, or 500 ms of latency. Clearly that is far too long.
Thus, another approach would be to eliminate B frames and only use I and P frames. (One consequence of this is the data rate would increase for a given quality level, but for the sake of consistency in this example, let's continue to assume that each I frame is 160 Kb and the average P frame is 16 Kb in size, and thus the data rate is still 1.1 Mbps) This approach eliminates the unavoidable latency introduced by B frames, since the decoding of each P frame is only reliant upon the prior received frame. A problem that remains with this approach is that an I frame is so much larger than an average P frame, that on a low bandwidth channel, as is typical in most homes and in many offices, the transmission of the I frame adds substantial latency. This is illustrated inFIG. 6b. The video stream data rate624is below the available max data rate621except for the I frames, where the peak data rate required for the I frames623far exceeds the available max data rate622(and even the rated max data rate621). The data rate required by the P frames is less than the available max data rate. Even if the available max data rate peaks at 2.2 Mbps remains steadily at its 2.2 Mbps peak rate, it will take 160 Kb/2.2 Mb=71 ms to transmit the I frame, and if the available max data rate622dips by 50% (1.1 Mbps), it will take 142 ms to transmit the I frame. So, the latency in transmitting the I frame will fall somewhere in between 71-142 ms. This latency is additive to the latencies identified inFIG. 4b, which in the worst case added up to 70 ms, so this would result in a total round trip latency of 141-222 ms from the point the user actuates input device421until an image appears on display device422, which is far too high. And if the available max data rate dips below 2.2 Mbps, the latency will increase further.
Note also that there generally are severe consequences to “jamming” an ISP with peak data rate623that are far in excess of the available data rate622. The equipment in different ISPs will behave differently, but the following behaviors are quite common among DSL and cable modem ISPs when receiving packets at much higher data rate than the available data rate622: (a) delaying the packets by queuing them (introducing latency), (b) dropping some or all of the packets, (c) disabling the connection for a period of time (most likely because the ISP is concerned it is a malicious attack, such as “denial of service” attack). Thus, transmitting a packet stream at full data rate with characteristics such as those shown inFIG. 6bis not a viable option. The peaks623may be queued up at the hosting service210and sent at a data rate below the available max data rate, introducing the unacceptable latency described in the preceding paragraph.
Further, the video stream data rate sequence624shown inFIG. 6bis a very “tame” video stream data rate sequence and would be the sort of data rate sequence that one would expect to result from compressing the video from a video sequence that does not change very much and has very little motion (e.g., as would be common in video teleconferencing where the cameras are in a fixed position and have little motion, and the objects, in the scene, e.g., seated people talking, show little motion).
The video stream data rate sequence634shown inFIG. 6cis a sequence typical to what one would expect to see from video with far more action, such as might be generated in a motion picture or a video game, or in some application software. Note that in addition to the I frame peaks633, there are also P frame peaks such as635and636that are quite large and exceed the available max data rate on many occasions. Although these P frame peaks are not quite as large as the I frame peaks, they still are far too large to be carried by the channel at full data rate, and as with the I frame peaks, they P frame peaks must be transmitted slowly (thereby increasingly latency).
On a high bandwidth channel (e.g., a 100 Mbps LAN, or a high bandwidth 100 Mbps private connection) the network would be able to tolerate large peaks, such as I frame peaks633or P frame peaks636, and in principle, low latency could be maintained. But, such networks are frequently shared amongst many users (e.g., in an office environment), and such “peaky” data would impact the performance of the LAN, particularly if the network traffic was routed to a private shared connection (e.g., from a remote data center to an office). To start with, bear in mind that this example is of a relatively low resolution video stream of 640×480 pixels at 60 fps. HDTV streams of 1920×1080 at 60 fps are readily handled by modern computers and displays, and 2560×1440 resolution displays at 60 fps are increasingly available (e.g., Apple, Inc.'s 30″ display). A high action video sequence at 1920×1080 at 60 fps may require 4.5 Mbps using H.264 compression for a reasonable quality level. If we assume the I frames peak at 10× the nominal data rate, that would result in 45 Mbps peaks, as well as smaller, but still considerable, P frame peak. If several users were receiving video streams on the same 100 Mbps network (e.g., a private network connection between an office and data center), it is easy to see how the peaks from several users' video stream could happen to align, overwhelming the bandwidth of the network, and potentially overwhelming the bandwidth of the backplanes of the switches supporting the users on the network. Even in the case of a Gigabit Ethernet network, if enough users had enough peaks aligned at once, it could overwhelm the network or the network switches. And, once 2560×1440 resolution video becomes more commonplace, the average video stream data rate may be 9.5 Mbps, resulting in perhaps a 95 Mbps peak data rate. Needless to say, a 100 Mbps connection between a data center and an office (which today is an exceptionally fast connection) would be completely swamped by the peak traffic from a single user. Thus, even though LANs and private network connections can be more tolerant of peaky streaming video, the streaming video with high peaks is not desirable and might require special planning and accommodation by an office's IT department.
Of course, for standard linear video applications these issues are not a problem because the data rate is “smoothed” at the point of transmission and the data for each frame below the max available data rate622, and a buffer in the client stores a sequence of I, P and B frames before they are decompressed. Thus, the data rate over the network remains close to the average data rate of the video stream. Unfortunately, this introduces latency, even if B frames are not used, that is unacceptable for low-latency applications such as video games and applications require fast response time.
One prior art solution to mitigating video streams that have high peaks is to use a technique often referred to as “Constant Bit Rate” (CBR) encoding. Although the term CBR would seem to imply that all frames are compressed to have the same bit rate (i.e., size), what it usually refers to is a compression paradigm where a maximum bit rate across a certain number of frames (in our case, 1 frame) is allowed. For example, in the case ofFIG. 6c, if a CBR constraint were applied to the encoding that limited the bit rate to, for example, 70% of the rated max data rate621, then the compression algorithm would limit the compression of each of the frames so that any frame that would normally be compressed using more than 70% of the rated max data rate621would be compressed with less bits. The result of this is that frames that would normally require more bits to maintain a given quality level would be “starved” of bits and the image quality of those frames would be worse than that of other frames that do not require more bits than the 70% of the rate max data rate621. This approach can produce acceptable results for certain types of compressed video where there (a) little motion or scene changes are expected and (b) the users can accept periodic quality degradation. A good example of a CBR-suited application is video teleconferencing since there are few peaks, and if the quality degrades briefly (for example, if the camera is panned, resulting in significant scene motion and large peaks, during the panning there may not be enough bits for high-quality image compression, which would result in degraded image quality), it is acceptable for most users. Unfortunately, CBR is not well-suited for many other applications which have scenes of high complexity or a great deal of motion and/or where a reasonably constant level of quality is required.
The low-latency compression logic404employed in one embodiment uses several different techniques to address the range of problems with streaming low-latency compressed video, while maintaining high quality. First, the low-latency compression logic404generates only I frames and P frames, thereby alleviating the need to wait several frame times to decode each B frame. In addition, as illustrated inFIG. 7a, in one embodiment, the low-latency compression logic404subdivides each uncompressed frame701-760into a series of “tiles” and individually encodes each tile as either an I frame or a P frame. The group of compressed I frames and P frames are referred to herein as “R frames”711-770. In the specific example shown inFIG. 7a, each uncompressed frame is subdivided into a 4×4 matrix of 16 tiles. However, these underlying principles are not limited to any particular subdivision scheme.
In one embodiment, the low-latency compression logic404divides up the video frame into a number of tiles, and encodes (i.e., compresses) one tile from each frame as an I frame (i.e., the tile is compressed as if it is a separate video frame of 1/16ththe size of the full image, and the compression used for this “mini” frame is I frame compression) and the remaining tiles as P frames (i.e., the compression used for each “mini” 1/16thframe is P frame compression). Tiles compressed as I frames and as P frames shall be referred to as “I tiles” and “P tiles”, respectively. With each successive video frame, the tile to be encoded as an I tile is changed. Thus, in a given frame time, only one tile of the tiles in the video frame is an I tile, and the remainder of the tiles are P tiles. For example, inFIG. 7a, tile0of uncompressed frame701is encoded as I tile I0and the remaining 1-15 tiles are encoded as P tiles P1through P15to produce R frame711. In the next uncompressed video frame702, tile1of uncompressed frame701is encoded as I tile I1and the remaining tiles0and2through15are encoded as P tiles, P0and P2through P15, to produce R frame712. Thus, the I tiles and P tiles for tiles are progressively interleaved in time over successive frames. The process continues until a R tile770is generated with the last tile in the matrix encoded as an I tile (i.e., I15). The process then starts over, generating another R frame such as frame711(i.e., encoding an I tile for tile0) etc. Although not illustrated inFIG. 7a, in one embodiment, the first R frame of the video sequence of R frames contains only I tiles (i.e., so that subsequent P frames have reference image data from which to calculate motion). Alternatively, in one embodiment, the startup sequence uses the same I tile pattern as normal, but does not include P tiles for those tiles that have not yet been encoded with an I tile. In other words, certain tiles are not encoded with any data until the first I tile arrives, thereby avoiding startup peaks in the video stream data rate934inFIG. 9a, which is explained in further detail below. Moreover, as described below, various different sizes and shapes may be used for the tiles while still complying with these underlying principles.
The video decompression logic412running on the client415decompresses each tile as if it is a separate video sequence of small I and P frames, and then renders each tile to the frame buffer driving display device422. For example, I0and P0from R frames711to770are used to decompress and render tile0of the video image. Similarly, I1and P1from R frames711to770are used to reconstruct tile1, and so on. As mentioned above, decompression of I frames and P frames is well known in the art, and decompression of I tiles and P tiles can be accomplished by having a multiple instances of a video decompressor running in the client415. Although multiplying processes would seem to increase the computational burden on client415, it actually doesn't because the tile themselves are proportionally smaller relative to the number of additional processes, so the number of pixels displayed is the same as if there were one process and using conventional full sized I and P frames.
This R frame technique significantly mitigates the bandwidth peaks typically associated with I frames illustrated inFIGS. 6band 6cbecause any given frame is mostly made up of P frames which are typically smaller than I frames. For example, assuming again that a typical I frame is 160 Kb, then the I tiles of each of the frames illustrated inFIG. 7awould be roughly 1/16 of this amount or 10 Kb. Similarly, assuming that a typical P frame is 16 Kb, then the P frames for each of the tiles illustrated inFIG. 7amay be roughly 1 Kb The end result is an R frame of approximately 10 Kb+15*1 Kb=25 Kb. So, each 60-frame sequence would be 25 Kb*60=1.5 Mbps. So, at 60 frames/second, this would require a channel capable of sustaining a bandwidth of 1.5 Mbps, but with much lower peaks due to I tiles being distributed throughout the 60-frame interval.
Note that in previous examples with the same assumed data rates for I frames and P frames, the average data rate was 1.1 Mbps. This is because in the previous examples, a new I frame was only introduced once every 60 frame times, whereas in this example, the 16 tiles that make up an I frame cycle through in 16 frames times, and as such the equivalent of an I frame is introduced every 16 frame times, resulting in a slightly higher average data rate. In practice, though, introducing more frequent I frames does not increase the data rate linearly. This is due to the fact that a P frame (or a P tile) primarily encodes the difference from the prior frame to the next. So, if the prior frame is quite similar to the next frame, the P frame will be very small, if the prior frame is quite different from the next frame, the P frame will be very large. But because a P frame is largely derived from the previous frame, rather than from the actual frame, the resulting encoded frame may contain more errors (e.g., visual artifacts) than an I frame with an adequate number of bits. And, when one P frame follows another P frame, what can occur is an accumulation of errors that gets worse when there is a long sequence of P frames. Now, a sophisticated video compressor will detect the fact that the quality of the image is degrading after a sequence of P frames and, if necessary, it will allocate more bits to subsequent P frames to bring up the quality or, if it is the most efficient course of action, replace a P frame with an I frame. So, when long sequences of P frames are used (e.g., 59 P frames, as in prior examples above) particularly when the scene has a great deal of complexity and/or motion, typically, more bits are needed for P frames as they get further removed from an I frame.
Or, to look at P frames from the opposite point of view, P frames that closely follow an I frame tend to require less bits than P frames that are further removed from an I frame. So, in the example shown inFIG. 7a, no P frame is further than 15 frames removed from an I frame that precedes it, where as in the prior example, a P frame could be 59 frames removed from an I frame. Thus, with more frequent I frames, the P frames are smaller. Of course, the exact relative sizes will vary based on the nature of the video stream, but in the example ofFIG. 7a, if an I tile is 10 Kb, P tiles on average, may be only 0.75 kb in size resulting in 10 Kb+15*0.75 Kb=21.25 Kb, or at 60 frames per second, the data rate would be 21.25 Kb*60=1.3 Mbps, or about 16% higher data rate than a stream with an I frame followed by 59 P frames at 1.1 Mbps. Once, again, the relative results between these two approaches to video compression will vary depending up on the video sequence, but typically, we have found empirically that using R-frames require about 20% more bits for a given level of quality than using I/P frame sequences. But, of course, R frames dramatically reduce the peaks which make the video sequences usable with far less latency than I/P frame sequences.
R frames can be configured in a variety of different ways, depending upon the nature of the video sequence, the reliability of the channel, and the available data rate. In an alternative embodiment, a different number of tiles is used than 16 in a 4×4 configuration. For example 2 tiles may be used in a 2×1 or 1×2 configuration, 4 tiles may be used in a 2×2, 4×1 or 1×4 configuration, 6 tiles may be used in a 3×2, 2×3, 6×1 or 1×6 configurations or 8 tiles may be used in a 4×2 (as shown inFIG. 7b), 2×4, 8×1 or 1×8 configuration. Note that the tiles need not be square, nor must the video frame be square, or even rectangular. The tiles can be broken up into whatever shape best suits the video stream and the application used.
In another embodiment, the cycling of the I and P tiles is not locked to the number of tiles. For example, in an 8-tile 4×2 configuration, a 16-cycle sequence can still be used as illustrated inFIG. 7b. Sequential uncompressed frames721,722,723are each divided into 8 tiles,0-7and each tile is compressed individually. From R frame731, only tile0is compressed as an I tile, and the remaining tiles are compressed as P tiles. For subsequent R frame732all of the 8 tiles are compressed as P tiles, and then for subsequent R frame733, tile1is compressed as an I tile and the other tiles are all compressed as P tiles. And, so the sequencing continues for 16 frames, with an I tile generated only every other frame, so the last I tile is generated for tile7during the 15thframe time (not shown inFIG. 7b) and during the 16thframe time R frame780is compressed using all P tiles. Then, the sequence begins again with tile0compressed as an I tile and the other tiles compressed as P tiles. As in the prior embodiment, the very first frame of the entire video sequence would typically be all I tiles, to provide a reference for P tiles from that point forward. The cycling of I tiles and P tiles need not even be an even multiple of the number of tiles. For example, with 8 tiles, each frame with an I tile can be followed by 2 frames with all P tiles, before another I tile is used. In yet another embodiment, certain tiles may be sequenced with I tiles more often than other tiles if, for example, certain areas of the screen are known to have more motion requiring from frequent I tiles, while others are more static (e.g., showing a score for a game) requiring less frequent I tiles. Moreover, although each frame is illustrated inFIGS. 7a-bwith a single I tile, multiple I tiles may be encoded in a single frame (depending on the bandwidth of the transmission channel). Conversely, certain frames or frame sequences may be transmitted with no I tiles (i.e., only P tiles).
The reason the approaches of the preceding paragraph works well is that while not having I tiles distributed across every single frame would seem to be result in larger peaks, the behavior of the system is not that simple. Since each tile is compressed separately from the other tiles, as the tiles get smaller the encoding of each tile can become less efficient, because the compressor of a given tile is not able to exploit similar image features and similar motion from the other tiles. Thus, dividing up the screen into 16 tiles generally will result in a less efficient encoding than dividing up the screen into 8 tiles. But, if the screen is divided into 8 tiles and it causes the data of a full I frame to be introduced every 8 frames instead of every 16 frames, it results in a much higher data rate overall. So, by introducing a full I frame every 16 frames instead of every 8 frames, the overall data rate is reduced. Also, by using 8 larger tiles instead of 16 smaller tiles, the overall data rate is reduced, which also mitigates to some degree the data peaks caused by the larger tiles.
In another embodiment, the low-latency video compression logic404inFIGS. 7aand 7bcontrols the allocation of bits to the various tiles in the R frames either by being pre-configured by settings, based on known characteristics of the video sequence to be compressed, or automatically, based upon an ongoing analysis of the image quality in each tile. For example, in some racing video games, the front of the player's car (which is relatively motionless in the scene) takes up a large part of the lower half of the screen, whereas the upper half of the screen is entirely filled with the oncoming roadway, buildings and scenery, which is almost always in motion. If the compression logic404allocates an equal number of bits to each tile, then the tiles on the bottom half of the screen (tiles4-7) in uncompressed frame721inFIG. 7b, will generally be compressed with higher quality than tiles than the tiles in the upper half of the screen (tiles0-3) in uncompressed frame721inFIG. 7b. If this particular game, or this particular scene of the game is known to have such characteristics, then the operators of the hosting service210can configure the compression logic404to allocate more bits to the tiles in the top of the screen than to tiles at the bottom of the screen. Or, the compression logic404can evaluate the quality of the compression of the tiles after frames are compressed (using one or more of many compression quality metrics, such as Peak Signal-To-Noise Ratio (PSNR)) and if it determines that over a certain window of time, certain tiles are consistently producing better quality results, then it gradually allocates more bits to tiles that are producing lower quality results, until the various tiles reach a similar level of quality. In an alternative embodiment, the compressor logic404allocates bits to achieve higher quality in a particular tile or group of tiles. For example, it may provide a better overall perceptual appearance to have higher quality in the center of the screen than at the edges.
In one embodiment, to improve resolution of certain regions of the video stream, the video compression logic404uses smaller tiles to encode areas of the video stream with relatively more scene complexity and/or motion than areas of the video stream with relatively less scene complexity and/or motion. For example, as illustrated inFIG. 8, smaller tiles are employed around a moving character805in one area of one R frame811(potentially followed by a series of R frames with the same tile sizes (not shown)). Then, when the character805moves to a new area of the image, smaller tiles are used around this new area within another R frame812, as illustrated. As mentioned above, various different sizes and shapes may be employed as “tiles” while still complying with these underlying principles.
While the cyclic I/P tiles described above substantially reduce the peaks in the data rate of a video stream, they do not eliminate the peaks entirely, particularly in the case of rapidly-changing or highly complex video imagery, such as occurs with motion pictures, video games, and some application software. For example, during a sudden scene transition, a complex frame may be followed by another complex frame that is completely different. Even though several I tiles may have preceded the scene transition by only a few frame times, they don't help in this situation because the new frame's material has no relation to the previous I tiles. In such a situation (and in other situations where even though not everything changes, much of the image changes), the video compressor404will determine that many, if not all, of the P tiles are more efficiently coded as I tiles, and what results is a very large peak in the data rate for that frame.
As discussed previously, it is simply the case that with most consumer-grade Internet connections (and many office connections), it simply is not feasible to “jam” data that exceeds the available maximum data rate shown as622inFIG. 6c, along with the rated maximum data rate621. Note that the rated maximum data rate621(e.g., “6 Mbps DSL”) is essentially a marketing number for users considering the purchase of an Internet connection, but generally it does not guarantee a level of performance. For the purposes of this application, it is irrelevant, since our only concern is the available maximum data rate622at the time the video is streamed through the connection. Consequently, inFIGS. 9aand 9c, as we describe a solution to the peaking problem, the rated maximum data rate is omitted from the graph, and only the available maximum data rate922is shown. The video stream data rate must not exceed the available maximum data rate922.
To address this, the first thing that the video compressor404does is determine a peak data rate941, which is a data rate the channel is able to handle steadily. This rate can be determined by a number of techniques. One such technique is by gradually sending an increasingly higher data rate test stream from the hosting service210to the client415inFIGS. 4aand 4b, and having the client provide feedback to the hosting service as to the level of packet loss and latency. As the packet loss and/or latency begins to show a sharp increase, that is an indication that the available maximum data rate922is being reached. After that, the hosting service210can gradually reduce the data rate of the test stream until the client415reports that for a reasonable period of time the test stream has been received with an acceptable level of packet loss and the latency is near minimal. This establishes a peak maximum data rate941, which will then be used as a peak data rate for streaming video. Over time, the peak data rate941will fluctuate (e.g., if another user in a household starts to heavily use the Internet connection), and the client415will need to constantly monitor it to see whether packet loss or latency increases, indicating the available max data rate922is dropping below the previously established peak data rate941, and if so the peak data rate941. Similarly, if over time the client415finds that the packet loss and latency remain at optimal levels, it can request that the video compressor slowly increases the data rate to see whether the available maximum data rate has increased (e.g., if another user in a household has stopped heavy use of the Internet connection), and again waiting until packet loss and/or higher latency indicates that the available maximum data rate922has been exceeded, and again a lower level can be found for the peak data rate941, but one that is perhaps higher than the level before testing an increased data rate. So, by using this technique (and other techniques like it) a peak data rate941can be found, and adjusted periodically as needed. The peak data rate941establishes the maximum data rate that can be used by the video compressor404to stream video to the user. The logic for determining the peak data rate may be implemented at the user premises211and/or on the hosting service210. At the user premises211, the client device415performs the calculations to determine the peak data rate and transmits this information back to the hosting service210; at the hosting service210, a server402at the hosting service performs the calculations to determine the peak data rate based on statistics received from the client415(e.g., packet loss, latency, max data rate, etc.).
FIG. 9ashows an example video stream data rate934that has substantial scene complexity and/or motion that has been generated using the cyclic I/P tile compression techniques described previously and illustrated inFIGS. 7a, 7band8. The video compressor404has been configured to output compressed video at an average data rate that is below the peak data rate941, and note that, most of the time, the video stream data rate remains below the peak data rate941. A comparison of data rate934with video stream data rate634shown inFIG. 6ccreated using I/P/B or I/P frames shows that the cyclic I/P tile compression produces a much smoother data rate. Still, at frame 2× peak952(which approaches 2× the peak data rate942) and frame 4× peak954(which approaches 4× the peak data rate944), the data rate exceeds the peak data rate941, which is unacceptable. In practice, even with high action video from rapidly changing video games, peaks in excess of peak data rate941occur in less than 2% of frames, peaks in excess of 2× peak data rate942occur rarely, and peaks in excess of 3× peak data rate943occur hardly ever. But, when they do occur (e.g., during a scene transition), the data rate required by them is necessary to produce a good quality video image.
One way to solve this problem is simply to configure the video compressor404such that its maximum data rate output is the peak data rate941. Unfortunately, the resulting video output quality during the peak frames is poor since the compression algorithm is “starved” for bits. What results is the appearance of compression artifacts when there are sudden transitions or fast motion, and in time, the user comes to realize that the artifacts always crop up when there is sudden changes or rapid motion, and they can become quite annoying.
Although the human visual system is quite sensitive to visual artifacts that appear during sudden changes or rapid motion, it is not very sensitive to detecting a reduction in frame rate in such situations. In fact, when such sudden changes occur, it appears that the human visual system is preoccupied with tracking the changes, and it doesn't notice if the frame rate briefly drops from 60 fps to 30 fps, and then returns immediately to 60 fps. And, in the case of a very dramatic transition, like a sudden scene change, the human visual system doesn't notice if the frame rate drops to 20 fps or even 15 fps, and then immediately returns to 60 fps. So long as the frame rate reduction only occurs infrequently, to a human observer, it appears that the video has been continuously running at 60 fps.
This property of the human visual system is exploited by the techniques illustrated inFIG. 9b. A server402(fromFIGS. 4aand 4b) produces an uncompressed video output stream at a steady frame rate (at 60 fps in one embodiment). A timeline shows each frame961-970output each 1/60thsecond. Each uncompressed video frame, starting with frame961, is output to the low-latency video compressor404, which compresses the frame in less than a frame time, producing for the first frame compressed frame1981. The data produced for the compressed frame1981may be larger or smaller, depending upon many factors, as previously described. If the data is small enough that it can be transmitted to the client415in a frame time ( 1/60thsecond) or less at the peak data rate941, then it is transmitted during transmit time (xmit time)991(the length of the arrow indicates the duration of the transmit time). In the next frame time, server402produces uncompressed frame2962, it is compressed to compressed frame2982, and it is transmitted to client415during transmit time992, which is less than a frame time at peak data rate941.
Then, in the next frame time, server402produces uncompressed frame3963. When it is compressed by video compressor404, the resulting compressed frame3983is more data than can be transmitted at the peak data rate941in one frame time. So, it is transmitted during transmit time (2× peak)993, which takes up all of the frame time and part of the next frame time. Now, during the next frame time, server402produces another uncompressed frame4964and outputs it to video compressor404but the data is ignored and illustrated with974. This is because video compressor404is configured to ignore further uncompressed video frames that arrive while it is still transmitting a prior compressed frame. Of course client415's video decompressor will fail to receive frame4, but it simply continues to display on display device422frame3for 2 frame times (i.e., briefly reduces the frame rate from 60 fps to 30 fps).
For the next frame5, server402outputs uncompressed frame5965, is compressed to compressed frame5985and transmitted within 1 frame during transmit time995. Client415's video decompressor decompresses frame5and displays it on display device422. Next, server402outputs uncompressed frame6966, video compressor404compresses it to compressed frame6986, but this time the resulting data is very large. The compressed frame is transmitted during transmit time (4× peak)996at the peak data rate941, but it takes almost 4 frame times to transmit the frame. During the next 3 frame times, video compressor404ignores 3 frames from server402, and client415's decompressor holds frame6steadily on the display device422for 4 frames times (i.e., briefly reduces the frame rate from 60 fps to 15 fps). Then finally, server402outputs frame10970, video compressor404compresses it into compressed frame10987, and it is transmitted during transmit time997, and client415's decompressor decompresses frame10and displays it on display device422and once again the video resumes at 60 fps.
Note that although video compressor404drops video frames from the video stream generated by server402, it does not drop audio data, regardless of what form the audio comes in, and it continues to compress the audio data when video frames are dropped and transmit them to client415, which continues to decompress the audio data and provide the audio to whatever device is used by the user to playback the audio. Thus audio continues unabated during periods when frames are dropped. Compressed audio consumes a relatively small percentage of bandwidth, compared to compressed video, and as result does not have a major impact on the overall data rate. Although it is not illustrated in any of the data rate diagrams, there is always data rate capacity reserved for the compressed audio stream within the peak data rate941.
The example just described inFIG. 9bwas chosen to illustrate how the frame rate drops during data rate peaks, but what it does not illustrate is that when the cyclic I/P tile techniques described previously are used, such data rate peaks, and the consequential dropped frames are rare, even during high scene complexity/high action sequences such as those that occur in video games, motion pictures and some application software. Consequently, the reduced frame rates are infrequent and brief, and the human visual system does not detect them.
If the frame rate reduction mechanism just described is applied to the video stream data rate illustrated inFIG. 9a, the resulting video stream data rate is illustrated inFIG. 9c. In this example, 2× peak952has been reduced to flattened 2× peak953, and 4× peak955has been reduced to flattened 4× peak955, and the entire video stream data rate934remains at or below the peak data rate941.
Thus, using the techniques described above, a high action video stream can be transmitted with low latency through the general Internet and through a consumer-grade Internet connection. Further, in an office environment on a LAN (e.g., 100 Mbs Ethernet or 802.11g wireless) or on a private network (e.g., 100 Mbps connection between a data center an offices) a high action video stream can be transmitted without peaks so that multiple users (e.g., transmitting 1920×1080 at 60 fps at 4.5 Mbps) can use the LAN or shared private data connection without having overlapping peaks overwhelming the network or the network switch backplanes.
Data Rate Adjustment
In one embodiment, the hosting service210initially assesses the available maximum data rate622and latency of the channel to determine an appropriate data rate for the video stream and then dynamically adjusts the data rate in response. To adjust the data rate, the hosting service210may, for example, modify the image resolution and/or the number of frames/second of the video stream to be sent to the client415. Also, the hosting service can adjust the quality level of the compressed video. When changing the resolution of the video stream, e.g., from a 1280×720 resolution to a 640×360 the video decompression logic412on the client415can scale up the image to maintain the same image size on the display screen.
In one embodiment, in a situation where the channel completely drops out, the hosting service210pauses the game. In the case of a multiplayer game, the hosting service reports to the other users that the user has dropped out of the game and/or pauses the game for the other users.
Dropped or Delayed Packets
In one embodiment, if data is lost due to packet loss between the video compressor404and client415inFIG. 4aor4b, or due to a packet being received out of order that arrives too late to decompress and meet the latency requirements of the decompressed frame, the video decompression logic412is able to mitigate the visual artifacts. In a streaming I/P frame implementation, if there is a lost/delayed packet, the entire screen is impacted, potentially causing the screen to completely freeze for a period of time or show other screen-wide visual artifacts. For example, if a lost/delayed packet causes the loss of an I frame, then the decompressor will lack a reference for all of the P frames that follow until a new I frame is received. If a P frame is lost, then it will impact the P frames for the entire screen that follow. Depending on how long it will be before an I frame appears, this will have a longer or shorter visual impact. Using interleaved I/P tiles as shown inFIGS. 7aand 7b, a lost/delayed packet is much less likely to impact the entire screen since it will only affect the tiles contained in the affected packet. If each tile's data is sent within an individual packet, then if a packet is lost, it will only affect one tile. Of course, the duration of the visual artifact will depend on whether an I tile packet is lost and, if a P tile is lost, how many frames it will take until an I tile appears. But, given that different tiles on the screen are being updated with I frames very frequently (potentially every frame), even if one tile on the screen is affected, other tiles may not be. Further, if some event cause a loss of several packets at once (e.g., a spike in power next to a DSL line that briefly disrupts the data flow), then some of the tiles will be affected more than others, but because some tiles will quickly be renewed with a new I tile, they will be only briefly affected. Also, with a streaming I/P frame implementation, not only are the I frames the most critical frame, but the I frames are extremely large, so if there is an event that causes a dropped/delayed packet, there is a higher probability that an I frame will be affected (i.e., if any part of an I frame is lost, it is unlikely that the I frame can be decompressed at all) than a much smaller I tile. For all of these reasons, using I/P tiles results in far fewer visual artifacts when packets are dropped/delayed than with I/P frames.
One embodiment attempts to reduce the effect of lost packets by intelligently packaging the compressed tiles within the TCP (transmission control protocol) packets or UDP (user datagram protocol) packets. For example, in one embodiment, tiles are aligned with packet boundaries whenever possible.FIG. 10aillustrates how tiles might be packed within a series of packets1001-1005without implementing this feature. Specifically, inFIG. 10a, tiles cross packet boundaries and are packed inefficiently so that the loss of a single packet results in the loss of multiple frames. For example, if packets1003or1004are lost, three tiles are lost, resulting in visual artifacts.
By contrast,FIG. 10billustrates tile packing logic1010for intelligently packing tiles within packets to reduce the effect of packet loss. First, the tile packing logic1010aligns tiles with packet boundaries. Thus, tiles T1, T3, T4, T7, and T2are aligned with the boundaries of packets1001-1005, respectively. The tile packing logic also attempts to fit tiles within packets in the most efficient manner possible, without crossing packet boundaries. Based on the size of each of the tiles, tiles T1and T6are combined in one packet1001; T3and T5are combined in one packet1002; tiles T4and T8are combined in one packet1003; tile T8is added to packet1004; and tile T2is added to packet1005. Thus, under this scheme, a single packet loss will result in the loss of no more than 2 tiles (rather than 3 tiles as illustrated inFIG. 10a).
One additional benefit to the embodiment shown inFIG. 10bis that the tiles are transmitted in a different order in which they are displayed within the image. This way, if adjacent packets are lost from the same event interfering with the transmission it will affect areas which are not near each other on the screen, creating a less noticeable artifacting on the display.
One embodiment employs forward error correction (FEC) techniques to protect certain portions of the video stream from channel errors. As is known in the art, FEC techniques such as Reed-Solomon and Viterbi generate and append error correction data information to data transmitted over a communications channel. If an error occurs in the underlying data (e.g., an I frame), then the FEC may be used to correct the error.
FEC codes increase the data rate of the transmission, so ideally, they are only used where they are most needed. If data is being sent that would not result in a very noticeable visual artifact, it may be preferable to not use FEC codes to protect the data. For example, a P tile that immediately precedes an I tile that is lost will only create a visual artifact (i.e., on tile on the screen will not be updated) for 1/60thof second on the screen. Such a visual artifact is barely detectable by the human eye. As P tiles are further back from an I tile, losing a P tile becomes increasingly more noticeable. For example, if a tile cycle pattern is an I tile followed by 15 P tiles before an I tile is available again, then if the P tile immediately following an I tile is lost, it will result in that tile showing an incorrect image for 15 frame times (at 60 fps, that would be 250 ms). The human eye will readily detect a disruption in a stream for 250 ms. So, the further back a P tile is from a new I tile (i.e., the closer a P tiles follows an I tile), the more noticeable the artifact. As previously discussed, though, in general, the closer a P tile follows an I tile, the smaller the data for that P tile. Thus, P tiles following I tiles not only are more critical to protect from being lost, but they are smaller in size. And, in general, the smaller the data is that needs to be protected, the smaller the FEC code needs to be to protect it.
So, as illustrated inFIG. 11a, in one embodiment, because of the importance of I tiles in the video stream, only I tiles are provided with FEC codes. Thus, FEC1101contains error correction code for I tile1100and FEC1104contains error correction code for I tile1103. In this embodiment, no FEC is generated for the P tiles.
In one embodiment illustrated inFIG. 11bFEC codes are also generated for P tiles which are most likely to cause visual artifacts if lost. In this embodiment, FECs1105provide error correction codes for the first 3 P tiles, but not for the P tiles that follow. In another embodiment, FEC codes are generated for P tiles which are smallest in data size (which will tend to self-select P tiles occurring the soonest after an I tile, which are the most critical to protect).
In another embodiment, rather than sending an FEC code with a tile, the tile is transmitted twice, each time in a different packet. If one packet is lost/delayed, the other packet is used.
In one embodiment, shown inFIG. 11c, FEC codes1111and1113are generated for audio packets,1110and1112, respectively, transmitted from the hosting service concurrently with the video. It is particularly important to maintain the integrity of the audio in a video stream because distorted audio (e.g., clicking or hissing) will result in a particularly undesirable user experience. The FEC codes help to ensure that the audio content is rendered at the client computer415without distortion.
In another embodiment, rather than sending an FEC code with audio data, the audio data is transmitted twice, each time in a different packet. If one packet is lost/delayed, the other packet is used.
In addition, in one embodiment illustrated inFIG. 11d, FEC codes1121and1123are used for user input commands1120and1122, respectively (e.g., button presses) transmitted upstream from the client415to the hosting service210. This is important because missing a button press or a mouse movement in a video game or an application could result in an undesirable user experience.
In another embodiment, rather than sending an FEC code with user input command data, the user input command data is transmitted twice, each time in a different packet. If one packet is lost/delayed, the other packet is used.
In one embodiment, the hosting service210assesses the quality of the communication channel with the client415to determine whether to use FEC and, if so, what portions of the video, audio and user commands to which FEC should be applied. Assessing the “quality” of the channel may include functions such as evaluating packet loss, latency, etc., as described above. If the channel is particularly unreliable, then the hosting service210may apply FEC to all of I tiles, P tiles, audio and user commands. By contrast, if the channel is reliable, then the hosting service210may apply FEC only to audio and user commands, or may not apply FEC to audio or video, or may not use FEC at all. Various other permutations of the application of FEC may be employed while still complying with these underlying principles. In one embodiment, the hosting service210continually monitors the conditions of the channel and changes the FEC policy accordingly.
In another embodiment, referring toFIGS. 4aand 4b, when a packet is lost/delayed resulting in the loss of tile data or if, perhaps because of a particularly bad packet loss, the FEC is unable to correct lost tile data, the client415assesses how many frames are left before a new I tile will be received and compares it to the round-trip latency from the client415to hosting service210. If the round-trip latency is less than the number of frames before a new I tile is due to arrive, then the client415sends a message to the hosting service210requesting a new I tile. This message is routed to the video compressor404, and rather than generating a P tile for the tile whose data had been lost, it generates an I tile. Given that the system shown inFIGS. 4aand 4bis designed to provide a round-trip latency that is typically less than 80 ms, this results in a tile being corrected within 80 ms (at 60 fps, frames are 16.67 ms of duration, thus in full frame times, 80 ms latency would result in a corrected a tile within 83.33 ms, which is 5 frame times—a noticeable disruption, but far less noticeable than, for example, a 250 ms disruption for 15 frames). When the compressor404generates such an I tile out of its usual cyclic order, if the I tile would cause the bandwidth of that frame to exceed the available bandwidth, then the compressor404will delay the cycles of the other tiles so that the other tiles receive P tiles during that frame time (even if one tile would normally be due an I tile during that frame), and then starting with the next frame the usual cycling will continue, and the tile that normally would have received an I tile in the preceding frame will receive an I tile. Although this action briefly delays the phase of the R frame cycling, it normally will not be noticeable visually.
Video and Audio Compressor/Decompressor Implementation
FIG. 12illustrates one particular embodiment in which a multi-core and/or multi-processor1200is used to compress 8 tiles in parallel. In one embodiment, a dual processor, quad core Xeon CPU computer system running at 2.66 GHz or higher is used, with each core implementing the open source x264 H.264 compressor as an independent process. However, various other hardware/software configurations may be used while still complying with these underlying principles. For example, each of the CPU cores can be replaced with an H.264 compressor implemented in an FPGA. In the example shown inFIG. 12, cores1201-1208are used to concurrently process the I tiles and P tiles as eight independent threads. As is well known in the art, current multi-core and multi-processor computer systems are inherently capable of multi-threading when integrated with multi-threading operating systems such as Microsoft Windows XP Professional Edition (either 64-bit or the 32-bit edition) and Linux.
In the embodiment illustrated inFIG. 12, since each of the 8 cores is responsible for just one tile, it operates largely independently from the other cores, each running a separate instantiation of x264. A PCI Express x1-based DVI capture card, such as the Sendero Video Imaging IP Development Board from Microtronix of Oosterhout, The Netherlands is used to capture uncompressed video at 640×480, 800×600, or 1280×720 resolution, and the FPGA on the card uses Direct Memory Access (DMA) to transfer the captured video through the DVI bus into system RAM. The tiles are arranged in a 4×2 arrangement1205(although they are illustrated as square tiles, in this embodiment they are of 160×240 resolution). Each instantiation of x264 is configured to compress one of the 8 160×240 tiles, and they are synchronized such that, after an initial I tile compression, each core enters into a cycle, each one frame out of phase with the other, to compress one I tile followed by seven P tiles, and illustrated inFIG. 12.
Each frame time, the resulting compressed tiles are combined into a packet stream, using the techniques previously described, and then the compressed tiles are transmitted to a destination client415.
Although not illustrated inFIG. 12, if the data rate of the combined 8 tiles exceeds a specified peak data rate941, then all 8×264 processes are suspended for as many frame times as are necessary until the data for the combined 8 tiles has been transmitted.
In one embodiment, client415is implemented as software on a PC running8instantiations of FFmpeg. A receiving process receives the 8 tiles, and each tile is routed to an FFmpeg instantiation, which decompresses the tile and renders it to an appropriate tile location on the display device422.
The client415receives keyboard, mouse, or game controller input from the PC's input device drivers and transmits it to the server402. The server402then applies the received input device data and applies it to the game or application running on the server402, which is a PC running Windows using an Intel 2.16 GHz Core Duo CPU. The server402then produces a new frame and outputs it through its DVI output, either from a motherboard-based graphics system, or through a NVIDIA 8800GTX PCI Express card's DVI output.
Simultaneously, the server402outputs the audio produced by game or applications through its digital audio output (e.g., S/PDIF), which is coupled to the digital audio input on the dual quad-core Xeon-based PC that is implementing the video compression. A Vorbis open source audio compressor is used to compress the audio simultaneously with the video using whatever core is available for the process thread. In one embodiment, the core that completes compressing its tile first executes the audio compression. The compressed audio is then transmitted along with the compressed video, and is decompressed on the client415using a Vorbis audio decompressor.
Hosting Service Server Center Distribution
Light through glass, such as optical fiber, travels at some fraction of the speed of light in a vacuum, and so an exact propagation speed for light in optical fiber could be determined. But, in practice, allowing time for routing delays, transmission inefficiencies, and other overhead, we have observed that optimal latencies on the Internet reflect transmission speeds closer to 50% the speed of light. Thus, an optimal 1000 mile round trip latency is approximately 22 ms, and an optimal 3000 mile round trip latency is about 64 ms. Thus, a single server on one US coast will be too far away to serve clients on the other coast (which can be as far as 3000 miles away) with the desired latency. However, as illustrated inFIG. 13a, if the hosting service210server center1300is located in the center of the US (e.g., Kansas, Nebraska, etc.), such that the distance to any point in the continental US is approximately 1500 miles or less, the round trip Internet latency could be as low as 32 ms. Referring toFIG. 4b, note that although the worst-case latencies allowed for the user ISP453is 25 ms, typically, we have observed latencies closer to 10-15 ms with DSL and cable modem systems. Also,FIG. 4bassumes a maximum distance from the user premises211to the hosting center210of 1000 miles. Thus, with a typical user ISP round trip latency of 15 ms used and a maximum Internet distance of 1500 miles for a round trip latency of 32 ms, the total round trip latency from the point a user actuates input device421and sees a response on display device422is 1+1+15+32+1+16+6+8=80 ms. So, the 80 ms response time can be typically achieved over an Internet distance of 1500 miles. This would allow any user premises with a short enough user ISP latency453in the continental US to access a single server center that is centrally located.
In another embodiment, illustrated inFIG. 13b, the hosting service210server centers, HS1-HS6, are strategically positioned around the United States (or other geographical region), with certain larger hosting service server centers positioned close to high population centers (e.g., HS2and HS5). In one embodiment, the server centers HS1-HS6exchange information via a network1301which may be the Internet or a private network or a combination of both. With multiple server centers, services can be provided at lower latency to users that have high user ISP latency453.
Although distance on the Internet is certainly a factor that contributes to round trip latency through the Internet, sometimes other factors come into play that are largely unrelated to latency. Sometimes a packet stream is routed through the Internet to a far away location and back again, resulting in latency from the long loop. Sometimes there is routing equipment on the path that is not operating properly, resulting in a delay of the transmission. Sometimes there is a traffic overloading a path which introduces delay. And, sometimes, there is a failure that prevents the user's ISP from routing to a given destination at all. Thus, while the general Internet usually provides connections from one point to another with a fairly reliable and optimal route and latency that is largely determined by distance (especially with long distance connections that result in routing outside of the user's local area) such reliability and latency is by no means guaranteed and often cannot be achieved from a user's premises to a given destination on the general Internet.
In one embodiment, when a user client415initially connects to the hosting service210to play a video game or use an application, the client communicates with each of the hosting service server centers HS1-HS6available upon startup (e.g., using the techniques described above). If the latency is low enough for a particular connection, then that connection is used. In one embodiment, the client communicates with all, or a subset, of the hosting service server centers and the one with the lowest latency connection is selected. The client may select the service center with the lowest latency connection or the service centers may identify the one with the lowest latency connection and provide this information (e.g., in the form of an Internet address) to the client.
If a particular hosting service server center is overloaded and/or the user's game or application can tolerate the latency to another, less loaded hosting service server center, then the client415may be redirected to the other hosting service server center. In such a situation, the game or application the user is running would be paused on the server402at the user's overloaded server center, and the game or application state data would be transferred to a server402at another hosting service server center. The game or application would then be resumed. In one embodiment, the hosting service210would wait until the game or application has either reached a natural pausing point (e.g., between levels in a game, or after the user initiates a “save” operation in application) to do the transfer. In yet another embodiment, the hosting service210would wait until user activity ceases for a specified period of time (e.g., 1 minute) and then would initiate the transfer at that time.
As described above, in one embodiment, the hosting service210subscribes to an Internet bypass service440ofFIG. 14to attempt to provide guaranteed latency to its clients. Internet bypass services, as used herein, are services that provide private network routes from one point to another on the Internet with guaranteed characteristics (e.g., latency, data rate, etc.). For example, if the hosting service210was receiving large amount of traffic from users using AT&T's DSL service offering in San Francisco, rather than routing to AT&T's San Francisco-based central offices, the hosting service210could lease a high-capacity private data connection from a service provider (perhaps AT&T itself or another provider) between the San Francisco-based central offices and one or more of the server centers for hosting service210. Then, if routes from all hosting service server centers HS1-HS6through the general Internet to a user in San Francisco using AT&T DSL result in too high latency, then private data connection could be used instead. Although private data connections are generally more expensive than the routes through the general Internet, so long as they remain a small percentage of the hosting service210connections to users, the overall cost impact will be low, and users will experience a more consistent service experience.
Server centers often have two layers of backup power in the event of power failure. The first layer typically is backup power from batteries (or from an alternative immediately available energy source, such a flywheel that is kept running and is attached to a generator), which provides power immediately when the power mains fail and keeps the server center running. If the power failure is brief, and the power mains return quickly (e.g., within a minute), then the batteries are all that is needed to keep the server center running. But if the power failure is for a longer period of time, then typically generators (e.g., diesel-powered) are started up that take over for the batteries and can run for as long as they have fuel. Such generators are extremely expensive since they must be capable of producing as much power as the server center normally gets from the power mains.
In one embodiment, each of the hosting services HS1-HS5share user data with one another so that if one server center has a power failure, it can pause the games and applications that are in process, and then transfer the game or application state data from each server402to servers402at other server centers, and then will notify the client415of each user to direct it communications to the new server402. Given that such situations occur infrequently, it may be acceptable to transfer a user to a hosting service server center which is not able to provide optimal latency (i.e., the user will simply have to tolerate higher latency for the duration of the power failure), which will allow for a much wider range of options for transferring users. For example, given the time zone differences across the US, users on the East Coast may be going to sleep at 11:30 PM while users on the West Coast at 8:30 PM are starting to peak in video game usage. If there is a power failure in a hosting service server center on the West Coast at that time, there may not be enough West Coast servers402at other hosting service server centers to handle all of the users. In such a situation, some of the users can be transferred to hosting service server centers on the East Coast which have available servers402, and the only consequence to the users would be higher latency. Once the users have been transferred from the server center that has lost power, the server center can then commence an orderly shutdown of its servers and equipment, such that all of the equipment has been shut down before the batteries (or other immediate power backup) is exhausted. In this way, the cost of a generator for the server center can be avoided.
In one embodiment, during times of heavy loading of the hosting service210(either due to peak user loading, or because one or more server centers have failed) users are transferred to other server centers on the basis of the latency requirements of the game or application they are using. So, users using games or applications that require low latency would be given preference to available low latency server connections when there is a limited supply.
Hosting Service Features
FIG. 15illustrates an embodiment of components of a server center for hosting service210utilized in the following feature descriptions. As with the hosting service210illustrated inFIG. 2a, the components of this server center are controlled and coordinated by a hosting service210control system401unless otherwise qualified.
Inbound internet traffic1501from user clients415is directed to inbound routing1502. Typically, inbound internet traffic1501will enter the server center via a high-speed fiber optic connection to the Internet, but any network connection means of adequate bandwidth, reliability and low latency will suffice. Inbound routing1502is a system of network (the network can be implemented as an Ethernet network, a fiber channel network, or through any other transport means) switches and routing servers supporting the switches which takes the arriving packets and routes each packet to the appropriate application/game (“app/game”) server1521-1525. In one embodiment, a packet which is delivered to a particular app/game server represents a subset of the data received from the client and/or may be translated/changed by other components (e.g., networking components such as gateways and routers) within the data center. In some cases, packets will be routed to more than one server1521-1525at a time, for example, if a game or application is running on multiple servers at once in parallel. RAID arrays1511-1512are connected to the inbound routing network1502, such that the app/game servers1521-1525can read and write to the RAID arrays1511-1512. Further, a RAID array1515(which may be implemented as multiple RAID arrays) is also connected to the inbound routing1502and data from RAID array1515can be read from app/game servers1521-1525. The inbound routing1502may be implemented in a wide range of prior art network architectures, including a tree structure of switches, with the inbound internet traffic1501at its root; in a mesh structure interconnecting all of the various devices; or as an interconnected series of subnets, with concentrated traffic amongst intercommunicating device segregated from concentrated traffic amongst other devices. One type of network configuration is a SAN which, although typically used for storage devices, it can also be used for general high-speed data transfer among devices. Also, the app/game servers1521-1525may each have multiple network connections to the inbound routing1502. For example, a server1521-1525may have a network connection to a subnet attached to RAID Arrays1511-1512and another network connection to a subnet attached to other devices.
The app/game servers1521-1525may all be configured the same, some differently, or all differently, as previously described in relation to servers402in the embodiment illustrated inFIG. 4a. In one embodiment, each user, when using the hosting service is typically using at least one app/game server1521-1525. For the sake of simplicity of explanation, we shall assume a given user is using app/game server1521, but multiple servers could be used by one user, and multiple users could share a single app/game server1521-1525. The user's control input, sent from client415as previously described is received as inbound Internet traffic1501, and is routed through inbound routing1502to app/game server1521. App/game server1521uses the user's control input as control input to the game or application running on the server, and computes the next frame of video and the audio associated with it. App/game server1521then outputs the uncompressed video/audio1529to shared video compression1530. App/game server may output the uncompressed video via any means, including one or more Gigabit Ethernet connections, but in one embodiment the video is output via a DVI connection and the audio and other compression and communication channel state information is output via a Universal Serial Bus (USB) connection.
The shared video compression1530compresses the uncompressed video and audio from the app/game servers1521-1525. The compression maybe implemented entirely in hardware, or in hardware running software. There may a dedicated compressor for each app/game server1521-1525, or if the compressors are fast enough, a given compressor can be used to compress the video/audio from more than one app/game server1521-1525. For example, at 60 fps a video frame time is 16.67 ms. If a compressor is able to compress a frame in 1 ms, then that compressor could be used to compress the video/audio from as many as 16 app/game servers1521-1525by taking input from one server after another, with the compressor saving the state of each video/audio compression process and switching context as it cycles amongst the video/audio streams from the servers. This results in substantial cost savings in compression hardware. Since different servers will be completing frames at different times, in one embodiment, the compressor resources are in a shared pool1530with shared storage means (e.g., RAM, Flash) for storing the state of each compression process, and when a server1521-1525frame is complete and ready to be compressed, a control means determines which compression resource is available at that time, provides the compression resource with the state of the server's compression process and the frame of uncompressed video/audio to compress.
Note that part of the state for each server's compression process includes information about the compression itself, such as the previous frame's decompressed frame buffer data which may be used as a reference for P tiles, the resolution of the video output; the quality of the compression; the tiling structure; the allocation of bits per tiles; the compression quality, the audio format (e.g., stereo, surround sound, Dolby® AC-3). But the compression process state also includes communication channel state information regarding the peak data rate941and whether a previous frame (as illustrated inFIG. 9b) is currently being output (and as result the current frame should be ignored), and potentially whether there are channel characteristics which should be considered in the compression, such as excessive packet loss, which affect decisions for the compression (e.g., in terms of the frequency of I tiles, etc.). As the peak data rate941or other channel characteristics change over time, as determined by an app/game server1521-1525supporting each user monitoring data sent from the client415, the app/game server1521-1525sends the relevant information to the shared hardware compression1530.
The shared hardware compression1530also packetizes the compressed video/audio using means such as those previously described, and if appropriate, applying FEC codes, duplicating certain data, or taking other steps to as to adequately ensure the ability of the video/audio data stream to be received by the client415and decompressed with as high a quality and reliability as feasible.
Some applications, such as those described below, require the video/audio output of a given app/game server1521-1525to be available at multiple resolutions (or in other multiple formats) simultaneously. If the app/game server1521-1525so notifies the shared hardware compression1530resource, then the uncompressed video audio1529of that app/game server1521-1525will be simultaneously compressed in different formats, different resolutions, and/or in different packet/error correction structures. In some cases, some compression resources can be shared amongst multiple compression processes compressing the same video/audio (e.g., in many compression algorithms, there is a step whereby the image is scaled to multiple sizes before applying compression. If different size images are required to be output, then this step can be used to serve several compression processes at once). In other cases, separate compression resources will be required for each format. In any case, the compressed video/audio1539of all of the various resolutions and formats required for a given app/game server1521-1525(be it one or many) will be output at once to outbound routing1540. In one embodiment the output of the compressed video/audio1539is in UDP format, so it is a unidirectional stream of packets.
The outbound routing network1540comprises a series of routing servers and switches which direct each compressed video/audio stream to the intended user(s) or other destinations through outbound Internet traffic1599interface (which typically would connect to a fiber interface to the Internet) and/or back to the delay buffer1515, and/or back to the inbound routing1502, and/or out through a private network (not shown) for video distribution. Note that (as described below) the outbound routing1540may output a given video/audio stream to multiple destinations at once. In one embodiment this is implemented using Internet Protocol (IP) multicast in which a given UDP stream intended to be streamed to multiple destinations at once is broadcasted, and the broadcast is repeated by the routing servers and switches in the outbound routing1540. The multiple destinations of the broadcast may be to multiple users' clients415via the Internet, to multiple app/game servers1521-1525via inbound routing1502, and/or to one or more delay buffers1515. Thus, the output of a given server1521-1522is compressed into one or multiple formats, and each compressed stream is directed to one or multiple destinations.
Further, in another embodiment, if multiple app/game servers1521-1525are used simultaneously by one user (e.g., in a parallel processing configuration to create the 3D output of a complex scene) and each server is producing part of the resulting image, the video output of multiple servers1521-1525can be combined by the shared hardware compression1530into a combined frame, and from that point forward it is handled as described above as if it came from a single app/game server1521-1525.
Note that in one embodiment, a copy (in at least the resolution or higher of video viewed by the user) of all video generated by app/game servers1521-1525is recorded in delay buffer1515for at least some number of minutes (15 minutes in one embodiment). This allows each user to “rewind” the video from each session in order to review previous work or exploits (in the case of a game). Thus, in one embodiment, each compressed video/audio output1539stream being routed to a user client415is also being multicasted to a delay buffer1515. When the video/audio is stored on a delay buffer1515, a directory on the delay buffer1515provides a cross reference between the network address of the app/game server1521-1525that is the source of the delayed video/audio and the location on the delay buffer1515where the delayed video/audio can be found.
Live, Instantly-Viewable, Instantly-Playable Games
App/game servers1521-1525may not only be used for running a given application or video game for a user, but they may also be used for creating the user interface applications for the hosting service210that supports navigation through hosting service210and other features. A screen shot of one such user interface application is shown inFIG. 16, a “Game Finder” screen. This particular user interface screen allows a user to watch15games that are being played live (or delayed) by other users. Each of the “thumbnail” video windows, such as1600is a live video window in motion showing the video from one user's game. The view shown in the thumbnail may be the same view that the user is seeing, or it may be a delayed view (e.g., if a user is playing a combat game, a user may not want other users to see where she is hiding and she may choose to delay any view of her gameplay by a period of time, say 10 minutes). The view may also be a camera view of a game that is different from any user's view. Through menu selections (not shown in this illustration), a user may choose a selection of games to view at once, based on a variety of criteria. As a small sampling of exemplary choices, the user may select a random selection of games (such as those shown inFIG. 16), all of one kind of games (all being played by different players), only the top-ranked players of a game, players at a given level in the game, or lower-ranked players (e.g., if the player is learning the basics), players who are “buddies” (or are rivals), games that have the most number of viewers, etc.
Note that generally, each user will decide whether the video from his or her game or application can be viewed by others and, if so, which others, and when it may be viewed by others, whether it is only viewable with a delay.
The app/game server1521-1525that is generating the user interface screen shown inFIG. 16acquires the 15 video/audio feeds by sending a message to the app/game server1521-1525for each user whose game it is requesting from. The message is sent through the inbound routing1502or another network. The message will include the size and format of the video/audio requested, and will identify the user viewing the user interface screen. A given user may choose to select “privacy” mode and not permit any other users to view video/audio of his game (either from his point of view or from another point of view), or as described in the previous paragraph, a user may choose to allow viewing of video/audio from her game, but delay the video/audio viewed. A user app/game server1521-1525receiving and accepting a request to allow its video/audio to be viewed will acknowledge as such to the requesting server, and it will also notify the shared hardware compression1530of the need to generate an additional compressed video stream in the requested format or screen size (assuming the format and screen size is different than one already being generated), and it will also indicate the destination for the compressed video (i.e., the requesting server). If the requested video/audio is only delayed, then the requesting app/game server1521-1525will be so notified, and it will acquire the delayed video/audio from a delay buffer1515by looking up the video/audio's location in the directory on the delay buffer1515and the network address of the app/game server1521-1525that is the source of the delayed video/audio. Once all of these requests have been generated and handled, up to 15 live thumbnail-sized video streams will be routed from the outbound routing1540to the inbound routing1502to the app/game server1521-1525generating the user interface screen, and will be decompressed and displayed by the server. Delayed video/audio streams may be in too large a screen size, and if so, the app/game server1521-1525will decompress the streams and scale down the video streams to thumbnail size. In one embodiment, requests for audio/video are sent to (and managed by) a central “management” service similar to the hosting service control system ofFIG. 4a(not shown inFIG. 15) which then redirects the requests to the appropriate app/game server1521-1525. Moreover, in one embodiment, no request may be required because the thumbnails are “pushed” to the clients of those users that allow it.
The audio from 15 games all mixed simultaneously might create a cacophony of sound. The user may choose to mix all of the sounds together in this way (perhaps just to get a sense of the “din” created by all the action being viewed), or the user may choose to just listen to the audio from one game at a time. The selection of a single game is accomplished by moving the yellow selection box1601(appearing as a black rectangular outline in the black-and-white rendering ofFIG. 16) to a given game (the yellow box movement can be accomplished by using arrow keys on a keyboard, by moving a mouse, by moving a joystick, or by pushing directional buttons on another device such as a mobile phone). Once a single game is selected, just the audio from that game plays. Also, game information1602is shown. In the case of this game, for example, the publisher logo (e.g., “EA” for “Electronic Arts”) and the game logo, “e.g., Need for Speed Carbon” and an orange horizontal bar (rendered inFIG. 16as a bar with vertical stripes) indicates in relative terms the number of people playing or viewing the game at that particular moment (many, in this case, so the game is “Hot”). Further “Stats” (i.e. statistics) are provided, indicating that there are 145 players actively playing 80 different instantiations of the Need for Speed Game (i.e., it can be played either by an individual player game or multiplayer game), and there are 680 viewers (of which this user is one). Note that these statistics (and other statistics) are collected by hosting service control system401and are stored on RAID arrays1511-1512, for keeping logs of the hosting service210operation and for appropriately billing users and paying publishers who provide content. Some of the statistics are recorded due to actions by the service control system401, and some are reported to the service control system401by the individual app/game server1521-1525. For example, the app/game server1521-1525running this Game Finder application sends messages to the hosting service control system401when games are being viewed (and when they are ceased to be viewed) so that it may update the statistics of how many games are in view. Some of the statistics are available for user interface applications such as this Game Finder application.
If the user clicks an activation button on their input device, they will see the thumbnail video in the yellow box zoom up while continuing to play live video to full screen size. This effect is shown in process inFIG. 17. Note that video window1700has grown in size. To implement this effect, the app/game server1521-1525requests from the app/game server1521-1525running the game selected to have a copy of the video stream for a full screen size (at the resolution of the user's display device422) of the game routed to it. The app/game server1521-1525running the game notifies the shared hardware compressor1530that a thumbnail-sized copy of the game is no longer needed (unless another app/game server1521-1525requires such a thumbnail), and then it directs it to send a full-screen size copy of the video to the app/game server1521-1525zooming the video. The user playing the game may or may not have a display device422that is the same resolution as that of the user zooming up the game. Further, other viewers of the game may or may not have display devices422that are the same resolution as the user zooming up the game (and may have different audio playback means, e.g., stereo or surround sound). Thus, the shared hardware compressor1530determines whether a suitable compressed video/audio stream is already being generated that meets the requirements of the user requesting the video/audio stream and if one does exist, it notifies the outbound routing1540to route a copy of the stream to the app/game server1521-1525zooming the video, and if not compresses another copy of the video that is suitable for that user and instructs the outbound routing to send the stream back to the inbound routing1502and the app/game server1521-1525zooming the video. This server, now receiving a full screen version of the selected video will decompress it and gradually scale it up to full size.
FIG. 18illustrates how the screen looks after the game has completely zoomed up to full screen and the game is shown at the full resolution of the user's display device422as indicated by the image pointed to by arrow1800. The app/game server1521-1525running the game finder application sends messages to the other app/game servers1521-1525that had been providing thumbnails that they are no longer needed and messages to the hosting service control server401that the other games are no longer being viewed. At this point the only display it is generating is an overlay1801at the top of the screen which provides information and menu controls to the user. Note that as this game has progressed, the audience has grown to 2,503 viewers. With so many viewers, there are bound to be many viewers with display devices422that have the same or nearly the same resolution (each app/game server1521-1525has the ability to scale the video for adjusting the fitting).
Because the game shown is a multiplayer game, the user may decide to join the game at some point. The hosting service210may or may not allow the user to join the game for a variety of reasons. For example, the user may have to pay to play the game and choose not to, the user may not have sufficient ranking to join that particular game (e.g., it would not be competitive for the other players), or the user's Internet connection may not have low enough latency to allow the user to play (e.g., there is not a latency constraint for viewing games, so a game that is being played far away (indeed, on another continent) can be viewed without latency concerns, but for a game to be played, the latency must be low enough for the user to (a) enjoy the game, and (b) be on equal footing with the other players who may have lower latency connections). If the user is permitted to play, then app/game server1521-1525that had been providing the Game Finder user interface for the user will request that the hosting service control server401initiate (i.e., locate and start up) an app/game server1521-1525that is suitably configured for playing the particular game to load the game from a RAID array1511-1512, and then the hosting service control server401will instruct the inbound routing1502to transfer the control signals from the user to the app/game game server now hosting the game and it will instruct the shared hardware compression1530to switch from compressing the video/audio from the app/game server that had been hosting the Game Finder application to compressing the video/audio from the app/game server now hosting the game. The vertical sync of the Game Finder app/game service and the new app/game server hosting the game are not synchronized, and as a result there is likely to be a time difference between the two syncs. Because the shared video compression hardware1530will begin compressing video upon an app/game server1521-1525completing a video frame, the first frame from the new server may be completed sooner than a full frame time of the old server, which may be before the prior compressed frame completing its transmission (e.g., consider transmit time992ofFIG. 9b: if uncompressed frame3963were completed half a frame time early, it would impinge upon the transmit time992). In such a situation the shared video compression hardware1530will ignore the first frame from the new server (e.g., like Frame4964is ignored974), and the client415will hold the last frame from the old server an extra frame time, and the shared video compression hardware1530will begin compressing the next frame time video from the new app/game server hosting the game. Visually, to the user, the transition from one app/game server to the other will be seamless. The hosting service control server401will then notify app/game game server1521-1525that had been hosting the Game Finder to switch to an idle state, until it is needed again.
The user then is able to play the game. And, what is exceptional is the game will play perceptually instantly (since it will have loaded onto the app/game game server1521-1525from a RAID array1511-1512at gigabit/second speed), and the game will be loaded onto a server exactly suited for the game together with an operating system exactly configured for the game with the ideal drivers, registry configuration (in the case of Windows), and with no other applications running on the server that might compete with the game's operation.
Also, as the user progresses through the game, each of the segments of the game will load into the server at gigabit/second speed (i.e., 1 gigabyte loads in 8 seconds) from the RAID array1511-1512, and because of the vast storage capacity of the RAID array1511-1512(since it is a shared resource among many users, it can be very large, yet still be cost effective), geometry setup or other game segment setup can be pre-computed and stored on the RAID array1511-1512and loaded extremely rapidly. Moreover, because the hardware configuration and computational capabilities of each app/game server1521-1525is known, pixel and vertex shaders can be pre-computed.
Thus, the game will start up almost instantly, it will run in an ideal environment, and subsequent segments will load almost instantly.
But, beyond these advantages, the user will be able to view others playing the game (via the Game Finder, previously described and other means) and both decide if the game is interesting, and if so, learn tips from watching others. And, the user will be able to demo the game instantly, without having to wait for a large download and/or installation, and the user will be able to play the game instantly, perhaps on a trial basis for a smaller fee, or on a longer term basis. And, the user will be able to play the game on a Windows PC, a Macintosh, on a television set, at home, when traveling, and even on a mobile phone, with a low enough latency wireless connection (although latency will not be an issue for just spectating). And, this can all be accomplished without ever physically owning a copy of the game.
As mentioned previously, the user can decide to not allow his gameplay to be viewable by others, to allow his game to be viewable after a delay, to allow his game to be viewable by selected users, or to allow his game to be viewable by all users. Regardless, the video/audio will be stored, in one embodiment, for 15 minutes in a delay buffer1515, and the user will be able to “rewind” and view his prior game play, and pause, play it back slowly, fast forward, etc., just as he would be able to do had he been watching TV with a Digital Video Recorder (DVR). Although in this example, the user is playing a game, the same “DVR” capability is available if the user is using an application. This can be helpful in reviewing prior work and in other applications as detailed below. Further, if the game was designed with the capability of rewinding based on utilizing game state information, such that the camera view can be changed, etc., then this “3D DVR” capability will also be supported, but it will require the game to be designed to support it. The “DVR” capability using a delay buffer1515will work with any game or application, limited of course, to the video that was generated when the game or application was used, but in the case of games with 3D DVR capability, the user can control a “fly through” in 3D of a previously played segment, and have the delay buffer1515record the resulting video and have the game state of the game segment recorded. Thus, a particular “fly-through” will be recorded as compressed video, but since the game state will also be recorded, a different fly-through will be possible at a later date of the same segment of the game.
As described below, users on the hosting service210will each have a User Page, where they can post information about themselves and other data. Among of the things that users will be able to post are video segments from game play that they have saved. For example, if the user has overcome a particularly difficult challenge in a game, the user can “rewind” to just before the spot where they had their great accomplishment in the game, and then instruct the hosting service210to save a video segment of some duration (e.g., 30 seconds) on the user's User Page for other users to watch. To implement this, it is simply a matter of the app/game server1521-1525that the user is using to playback the video stored in a delay buffer1515to a RAID array1511-1512and then index that video segment on the user's User Page.
If the game has the capability of 3D DVR, as described above, then the game state information required for the 3D DVR can also be recorded by the user and made available for the user's User Page.
In the event that a game is designed to have “spectators” (i.e., users that are able to travel through the 3D world and observe the action without participating in it) in addition to active players, then the Game Finder application will enable users to join games as spectators as well as players. From an implementation point of view, there is no difference to the hosting system210to if a user is a spectator instead of an active player. The game will be loaded onto an app/game server1521-1525and the user will be controlling the game (e.g., controlling a virtual camera that views into the world). The only difference will be the game experience of the user.
Multiple User Collaboration
Another feature of the hosting service210is the ability to for multiple users to collaborate while viewing live video, even if using widely disparate devices for viewing. This is useful both when playing games and when using applications.
Many PCs and mobile phones are equipped with video cameras and have the capability to do real-time video compression, particularly when the image is small. Also, small cameras are available that can be attached to a television, and it is not difficult to implement real-time compression either in software or using one of many hardware compression devices to compress the video. Also, many PCs and all mobile phones have microphones, and headsets are available with microphones.
Such cameras and/or microphones, combined with local video/audio compression capability (particularly employing the low latency video compression techniques described herein) will enable a user to transmit video and/or audio from the user premises211to the hosting service210, together with the input device control data. When such techniques are employed, then a capability illustrated inFIG. 19is achievable: a user can have his video and audio1900appear on the screen within another user's game or application. This example is a multiplayer game, where teammates collaborate in a car race. A user's video/audio could be selectively viewable/hearable only by their teammates. And, since there would be effectively no latency, using the techniques described above the players would be able to talk or make motions to each other in real-time without perceptible delay.
This video/audio integration is accomplished by having the compressed video and/or audio from a user's camera/microphone arrive as inbound internet traffic1501. Then the inbound routing1502routes the video and/or audio to the app/game game servers1521-1525that are permitted to view/hear the video and/or audio. Then, the users of the respective app/game game servers1521-1525that choose to use the video and/or audio decompress it and integrate as desired to appear within the game or application, such as illustrated by1900.
The example ofFIG. 19shows how such collaboration is used in a game, but such collaboration can be an immensely powerful tool for applications. Consider a situation where a large building is being designed for New York city by architects in Chicago for a real estate developer based in New York, but the decision involves a financial investor who is traveling and happens to be in an airport in Miami, and a decision needs to be made about certain design elements of the building in terms of how it fits in with the buildings near it, to satisfy both the investor and the real estate developer. Assume the architectural firm has a high resolution monitor with a camera attached to a PC in Chicago, the real estate developer has a laptop with a camera in New York, and the investor has a mobile phone with a camera in Miami. The architectural firm can use the hosting service210to host a powerful architectural design application that is capable of highly realistic 3D rendering, and it can make use of a large database of the buildings in New York City, as well as a database of the building under design. The architectural design application will execute on one, or if it requires a great deal of computational power on several, of the app/game servers1521-1525. Each of the 3 users at disparate locations will connect to the hosting service210, and each will have a simultaneous view of the video output of the architectural design application, but it will be will appropriately sized by the shared hardware compression1530for the given device and network connection characteristics that each user has (e.g., the architectural firm may see a 2560×1440 60 fps display through a 20 Mbps commercial Internet connection, the real estate developer in New York may see a 1280×720 60 fps image over a 6 Mbps DSL connection on his laptop, and the investor may see a 320×180 60 fps image over a 250 Kbps cellular data connection on her mobile phone. Each party will hear the voice of the other parties (the conference calling will be handled by any of many widely available conference calling software package in the app/game server(s)1521-1525) and, through actuation of a button on a user input device, a user will be able to make video appear of themselves using their local camera. As the meeting proceeds, the architects will be able to show what the build looks like as they rotate it and fly by it next to the other building in the area, with extremely photorealistic 3D rendering, and the same video will be visible to all parties, at the resolution of each party's display device. It won't matter that none of the local devices used by any party is incapable of handling the 3D animation with such realism, let alone downloading or even storing the vast database required to render the surrounding buildings in New York City. From the point of view of each of the users, despite the distance apart, and despite the disparate local devices they simply will have a seamless experience with an incredible degree of realism. And, when one party wants their face to be seen to better convey their emotional state, they can do so. Further, if either the real estate develop or the investor want to take control of the architectural program and use their own input device (be it a keyboard, mouse, keypad or touch screen), they can, and it will respond with no perceptual latency (assuming their network connection does not have unreasonable latency). For example, in the case of the mobile phone, if the mobile phone is connected to a WiFi network at the airport, it will have very low latency. But if it is using the cellular data networks available today in the US, it probably will suffer from a noticeable lag. Still, for most of the purposes of the meeting, where the investor is watching the architects control the building fly-by or for talking of video teleconferencing, even cellular latency should be acceptable.
Finally, at the end of the collaborative conference call, the real estate developer and the investor will have made their comments and signed off from the hosting service, the architectural firm will be able to “rewind” the video of the conference that has been recorded on a delay buffer1515and review the comments, facial expressions and/or actions applied to the 3D model of the building made during the meeting. If there are particular segments they want to save, those segments of video/audio can be moved from delay buffer1515to a RAID array1511-1512for archival storage and later playback.
Also, from a cost perspective, if the architects only need to use the computation power and the large database of New York City for a 15 minute conference call, they need only pay for the time that the resources are used, rather than having to own high powered workstations and having to purchase an expensive copy of a large database.
Video-Rich Community Services
The hosting service210enables an unprecedented opportunity for establishing video-rich community services on the Internet.FIG. 20shows an exemplary User Page for a game player on the hosting service210. As with the Game Finder application, the User Page is an application that runs on one of the app/game servers1521-1525. All of the thumbnails and video windows on this page show constantly moving video (if the segments are short, they loop).
Using a video camera or by uploading video, the user (whose username is “KILLHAZARD”) is able to post a video of himself2000that other users can view. The video is stored on a RAID array1511-1512. Also, when other users come to KILLHAZARD's User Page, if KILLHAZARD is using the hosting service210at the time, live video2001of whatever he is doing (assuming he permits users viewing his User Page to watch him) will be shown. This will be accomplished by app/game server1521-1525hosting the User Page application requesting from the service control system401whether KILLHAZARD is active and if so, the app/game server1521-1525he is using. Then, using the same methods used by the Game Finder application, a compressed video stream in a suitable resolution and format will be sent to the app/game server1521-1525running the User Page application and it will be displayed. If a user selects the window with KILLHAZARD's live gameplay, and then appropriately clicks on their input device, the window will zoom up (again using the same methods as the Game Finder applications, and the live video will fill the screen, at the resolution of the watching user's display device422, appropriate for the characteristics of the watching user's Internet connection.
A key advantage of this over prior art approaches is the user viewing the User Page is able to see a game played live that the user does not own, and may very well not have a local computer or game console capable of playing the game. It offers a great opportunity for the user to see the user shown in the User Page “in action” playing games, and it is an opportunity to learn about a game that the viewing user might want to try or get better at.
Camera-recorded or uploaded video clips from KILLHAZARD's buddies2002are also shown on the User Page, and underneath each video clip is text that indicates whether the buddy is online playing a game (e.g., six_shot is playing the game “Eragon” (shown here as Game4) and MrSnuggles99 is Offline, etc.). By clicking on a menu item (not shown) the buddy video clips switch from showing recorded or uploaded videos to live video of what the buddies who are currently playing games on the hosting service210are doing at that moment in their games. So, it becomes a Game Finder grouping for buddies. If a buddy's game is selected and the user clicks on it, it will zoom up to full screen, and the user will be able to watch the game played full screen live.
Again, the user viewing the buddy's game does not own a copy of the game, nor the local computing/game console resources to play the game. The game viewing is effectively instantaneous.
As previously described above, when a user plays a game on the hosting service210, the user is able to “rewind” the game and find a video segment he wants to save, and then saves the video segment to his User Page. These are called “Brag Clips™”. The video segments2003are all Brag Clips2003saved by KILLHAZARD from previous games that he has played. Number2004shows how many times a Brag Clip has been viewed, and when the Brag Clip is viewed, users have an opportunity to rate them, and the number of orange (shown here as black outlines) keyhole-shaped icons2005indicate how high the rating is. The Brag Clips2003loop constantly when a user views the User Page, along with the rest of the video on the page. If the user selects and clicks on one of the Brag Clips2003, it zooms up to present the Brag Clip2003, along with DVR controls to allow the clip to be played, paused, rewound, fast-forwarded, stepped through, etc.
The Brag Clip2003playback is implemented by the app/game server1521-1525loading the compressed video segment stored on a RAID array1511-1512when the user recorded the Brag Clip and decompressing it and playing it back.
Brag Clips2003can also be “3D DVR” video segments (i.e., a game state sequence from the game that can be replayed and allows the user to change the camera viewpoint) from games that support such capability. In this case the game state information is stored, in addition to a compressed video recording of the particular “fly through” the user made when the game segment was recorded. When the User Page is being viewed, and all of the thumbnails and video windows are constantly looping, a 3D DVR Brag Clip2003will constantly loop the Brag Clip2003that was recorded as compressed video when the user recorded the “fly through” of the game segment. But, when a user selects a 3D DVR Brag Clip2003and clicks on it, in addition to the DVR controls to allow the compressed video Brag Clip to be played, the user will be able to click on a button that gives them 3D DVR capability for the game segment. They will be able to control a camera “fly through” during the game segment on their own, and, if they wish (and the user who owns the user page so allows it) they will be able to record an alternative Brag Clip “fly through” in compressed video form will then be available to other viewers of the user page (either immediately, or after the owner of the user page has a chance to the review the Brag Clip).
This 3D DVR Brag Clip2003capability is enabled by activating the game that is about to replay the recorded game state information on another app/game server1521-1525. Since the game can be activated almost instantaneously (as previously described) it is not difficult to activate it, with its play limited to the game state recorded by the Brag Clip segment, and then allow the user to do a “fly through” with a camera while recording the compressed video to a delay buffer1515. Once the user has completed doing the “fly through” the game is deactivated.
From the user's point of view, activating a “fly through” with a 3D DVR Brag Clip2003is no more effort than controlling the DVR controls of a linear Brag Clip2003. They may know nothing about the game or even how to play the game. They are just a virtual camera operator peering into a 3D world during a game segment recorded by another.
Users will also be able to overdub their own audio onto Brag Clips that is either recorded from microphones or uploaded. In this way, Brag Clips can be used to create custom animations, using characters and actions from games. This animation technique is commonly known as “machinima”.
As users progress through games, they will achieve differing skill levels. The games played will report the accomplishments to the service control system401, and these skill levels will be shown on User Pages.
Interactive Animated Advertisements
Online advertisements have transitioned from text, to still images, to video, and now to interactive segments, typically implemented using animation thin clients like Adobe Flash. The reason animation thin clients are used is that users typically have little patience to be delayed for the privilege of having a product or service pitched to them. Also, thin clients run on very low-performance PCs and as such, the advertiser can have a high degree of confidence that the interactive ad will work properly. Unfortunately, animation thin clients such as Adobe Flash are limited in the degree of interactivity and the duration of the experience (to mitigate download time and to be operable on almost all user devices, including low-performance PCs and Macs without GPUs or high-performance CPUs).
FIG. 21illustrates an interactive advertisement where the user is to select the exterior and interior colors of a car while the car rotates around in a showroom, while real-time ray tracing shows how the car looks. Then the user chooses an avatar to drive the car, and then the user can take the car for a drive either on a race track, or through an exotic locale such as Monaco. The user can select a larger engine, or better tires, and then can see how the changed configuration affects the ability of the car to accelerate or hold the road.
Of course, the advertisement is effectively a sophisticated 3D video game. But for such an advertisement to be playable on a PC or a video game console it would require perhaps a 100 MB download and, in the case of the PC, it might require the installation of special drivers, and might not run at all if the PC lacks adequate CPU or GPU computing capability. Thus, such advertisements are impractical in prior art configurations.
In the hosting service210, such advertisements launch almost instantly, and run perfectly, no matter what the user's client415capabilities are. So, they launch more quickly than thin client interactive ads, are vastly richer in the experience, and are highly reliable.
Streaming Geometry During Real-Time Animation
RAID array1511-1512and the inbound routing1502can provide data rates that are so fast and with latencies so low that it is possible to design video games and applications that rely upon the RAID array1511-1512and the inbound routing1502to reliably deliver geometry on-the-fly in the midst of game play or in an application during real-time animation (e.g., a fly-through with a complex database.)
With prior art systems, such as the video game system shown inFIG. 1, the mass storage devices available, particularly in practical home devices, are far too slow to stream geometry in during game play except in situations where the required geometry was somewhat predictable. For example, in a driving game where there is a specified roadway, geometry for buildings that are coming into view can be reasonable well predicted and the mass storage devices can seek in advance to the location where the upcoming geometry is located.
But in a complex scene with unpredictable changes (e.g., in a battle scene with complex characters all around) if RAM on the PC or video game system is completely filled with geometry for the objects currently in view, and then the user suddenly turns their character around to view what is behind their character, if the geometry has not been pre-loaded into RAM, then there may be a delay before it can be displayed.
In the hosting service210, the RAID arrays1511-1512can stream data in excess of Gigabit Ethernet speed, and with a SAN network, it is possible to achieve 10 gigabit/second speed over 10 Gigabit Ethernet or over other network technologies. 10 gigabits/second will load a gigabyte of data in less than a second. In a 60 fps frame time (16.67 ms), approximately 170 megabits (21 MB) of data can be loaded. Rotating media, of course, even in a RAID configuration will still incur latencies greater than a frame time, but Flash-based RAID storage will eventually be as large as rotating media RAID arrays and will not incur such high latency. In one embodiment, massive RAM write-through caching is used to provide very low latency access.
Thus, with sufficiently high network speed, and sufficiently low enough latency mass storage, geometry can be streamed into app/game game servers1521-1525as fast as the CPUs and/or GPUs can process the 3D data. So, in the example given previously, where a user turns their character around suddenly and looks behind, the geometry for all of the characters behind can be loaded before the character completes the rotation, and thus, to the user, it will seem as if he or she is in a photorealistic world that is as real as live action.
As previously discussed, one of the last frontiers in photorealistic computer animation is the human face, and because of the sensitivity of the human eye to imperfections, the slightest error from a photoreal face can result in a negative reaction from the viewer.FIG. 22shows how a live performance captured using Contour™ Reality Capture Technology (subject of co-pending applications: “Apparatus and method for capturing the motion of a performer,” Ser. No. 10/942,609, Filed Sep. 15, 2004; “Apparatus and method for capturing the expression of a performer,” Ser. No. 10/942,413 Filed Sep. 15, 2004; “Apparatus and method for improving marker identification within a motion capture system,” Ser. No. 11/066,954, Filed Feb. 25, 2005; “Apparatus and method for performing motion capture using shutter synchronization,” Ser. No. 11/077,628, Filed Mar. 10, 2005; “Apparatus and method for performing motion capture using a random pattern on capture surfaces,” Ser. No. 11/255,854, Filed Oct. 20, 2005; “System and method for performing motion capture using phosphor application techniques,” Ser. No. 11/449,131, Filed Jun. 7, 2006; “System and method for performing motion capture by strobing a fluorescent lamp,” Ser. No. 11/449,043, Filed Jun. 7, 2006; “System and method for three dimensional capture of stop-motion animated characters,” Ser. No. 11/449,127, Filed Jun. 7, 2006″, each of which is assigned to the assignee of the present CIP application) results in a very smooth captured surface, then a high polygon-count tracked surface (i.e., the polygon motion follows the motion of the face precisely). Finally, when the video of the live performance is mapped on the tracked surface to produce a textured surface, a photoreal result is produced.
Although current GPU technology is able to render the number of polygons in the tracked surface and texture and light the surface in real-time, if the polygons and textures are changing every frame time (which will produce the most photoreal results) it will quickly consume all the available RAM of a modern PC or video game console.
Using the streaming geometry techniques described above, it becomes practical to continuously feed geometry into the app/game game servers1521-1525so that they can animate photoreal faces continuously, allowing the creation of video games with faces that are almost indistinguishable from live action faces.
Integration of Linear Content with Interactive Features
Motion pictures, television programming and audio material (collectively, “linear content”) is widely available to home and office users in many forms. Linear content can be acquired on physical media, like CD, DVD and Blu-ray media. It also can be recorded by DVRs from satellite and cable TV broadcast. And, it is available as pay-per-view (PPV) content through satellite and cable TV and as video-on-demand (VOD) on cable TV.
Increasingly linear content is available through the Internet, both as downloaded and as streaming content. Today, there really is not one place to go to experience all of the features associated with linear media. For example, DVDs and other video optical media typically have interactive features not available elsewhere, like director's commentaries, “making of” featurettes, etc. Online music sites have cover art and song information generally not available on CDs, but not all CDs are available online. And Web sites associated with television programming often have extra features, blogs and sometimes comments from the actors or creative staff.
Further, with many motion pictures or sports events, there are often video games that are released (in the case of motion pictures) often together with the linear media or (in the case of sports) may be closely tied to real-world events (e.g., the trading of players).
Hosting service210is well suited for the delivery of linear content in linking together the disparate forms of related content. Certainly, delivering motion pictures is no more challenging than delivering highly interactive video games, and the hosting service210is able to deliver linear content to a wide range of devices, in the home or office, or to mobile devices.FIG. 23shows an exemplary user interface page for hosting service210that shows a selection of linear content.
But, unlike most linear content delivery system, hosting service210is also able to deliver related interactive components (e.g., the menus and features on DVDs, the interactive overlays on HD-DVDs, and the Adobe Flash animation (as explained below) on Web sites). Thus, the client device415limitations no longer introduce limitations as to which features are available.
Further, the hosting system210is able to link together linear content with video game content dynamically, and in real-time. For example, if a user is watching a Quidditch match in a Harry Potter movie, and decides she would like to try playing Quidditch, she can just click a button and the movie will pause and immediately she will be transported to the Quidditch segment of a Harry Potter video game. After playing the Quidditch match, another click of a button, and the movie will resume instantly.
With photoreal graphics and production technology, where the photographically-captured video is indistinguishable from the live action characters, when a user makes a transition from a Quidditch game in a live action movie to a Quidditch game in a video game on a hosting service as described herein, the two scenes are virtually indistinguishable. This provides entirely new creative options for directors of both linear content and interactive (e.g., video game) content as the lines between the two worlds become indistinguishable.
Utilizing the hosting service architecture shown inFIG. 14the control of the virtual camera in a 3D movie can be offered to the viewer. For example, in a scene that takes place within a train car, it would be possible to allow the viewer to control the virtual camera and look around the car while the story progresses. This assumes that all of the 3D objects (“assets”) in the car are available as well as an adequate a level of computing power capable of rendering the scenes in real-time as well as the original movie.
And even for non-computer generated entertainment, there are very exciting interactive features that can be offered. For example, the 2005 motion picture “Pride and Prejudice” had many scenes in ornate old English mansions. For certain mansion scenes, the user may pause the video and then control the camera to take a tour of the mansion, or perhaps the surrounding area. To implement this, a camera could be carried through the mansion with a fish-eye lens as it keeps track of its position, much like prior art Apple, Inc. QuickTime VR is implemented. The various frames would then be transformed so the images are not distorted, and then stored on RAID array1511-1512along with the movie, and played back when the user chooses to go on a virtual tour.
With sports events, a live sports event, such as a basketball game, may be streamed through the hosting service210for users to watch, as they would for regular TV. After users watched a particular play, a video game of the game (eventually with basketball players looking as photoreal as the real players) could come up with the players starting in the same position, and the users (perhaps each taking control of one player) could redo the play to see if they could do better than the players.
The hosting service210described herein is extremely well-suited to support this futuristic world because it is able to bring to bear computing power and mass storage resources that are impractical to install in a home or in most office settings, and also its computing resources are always up-to-date, with the latest computing hardware available, whereas in a home setting, there will always be homes with older generation PCs and video games. And, in the hosting service210, all of this computing complexity is hidden from the user, so even though they may be using very sophisticated systems, from the user's point of view, it is a simple as changing channels on a television. Further, the users would be able to access all of the computing power and the experiences the computing power would bring from any client415.
Multiplayer Games
To the extent a game is a multiplayer game, then it will be able communicate both to app/game game servers1521-1525through the inbound routing1502network and, with a network bridge to the Internet (not shown) with servers or game machines that are not running in the hosting service210. When playing multiplayer games with computers on the general Internet, then the app/game game servers1521-1525will have the benefit of extremely fast access to the Internet (compared to if the game was running on a server at home), but they will be limited by the capabilities of the other computers playing the game on slower connections, and also potentially limited by the fact that the game servers on the Internet were designed to accommodate the least common denominator, which would be home computers on relatively slow consumer Internet connections.
But when a multiplayer game is played entirely within a hosting service210server center, then a world of difference is achievable. Each app/game game server1521-1525hosting a game for a user will be interconnected with other app/game game servers1521-1525as well as any servers that are hosting the central control for the multiplayer game with extremely high speed, extremely low latency connectivity and vast, very fast storage arrays. For example, if Gigabit Ethernet is used for the inbound routing1502network, then the app/game game servers1521-1525will be communicating among each other and communicating to any servers hosting the central control for the multiplayer game at gigabit/second speed with potentially only 1 ms of latency or less. Further, the RAID arrays1511-1512will be able to respond very rapidly and then transfer data at gigabit/second speeds. As an example, if a user customizes a character in terms of look and accoutrements such that the character has a large amount of geometry and behaviors that are unique to the character, with prior art systems limited to the game client running in the home on a PC or game console, if that character were to come into view of another user, the user would have to wait until a long, slow download completes so that all of the geometry and behavior data loads into their computer. Within the hosting service210, that same download could be over Gigabit Ethernet, served from a RAID array1511-1512at gigabit/second speed. Even if the home user had an 8 Mbps Internet connection (which is extremely fast by today's standards), Gigabit Ethernet is 100 times faster. So, what would take a minute over a fast Internet connection, would take less than a second over Gigabit Ethernet.
Top Player Groupings and Tournaments
The Hosting Service210is extremely well-suited for tournaments. Because no game is running in a local client, there is no opportunity for users to cheat (e.g., as they might have in a prior art tournament by modifying the copy of the game running on their local PC to give them an unfair advantage). Also, because of the ability of the output routing1540to multicast the UDP streams, the Hosting Service is210is able to broadcast the major tournaments to thousands or more of people in the audience at once.
In fact, when there are certain video streams that are so popular that thousands of users are receiving the same stream (e.g., showing views of a major tournament), it may be more efficient to send the video stream to a Content Delivery Network (CDN) such as Akamai or Limelight for mass distribution to many client devices415.
A similar level of efficiency can be gained when a CDN is used to show Game Finder pages of top player groupings.
For major tournaments, a live celebrity announcer can be used to provide commentary during certain matches. Although a large number of users will be watching a major tournament, and relatively small number will be playing in the tournament. The audio from the celebrity announcer can be routed to the app/game game servers1521-1525hosting the users playing in the tournament and hosting any spectator-mode copies of the game in the tournament, and the audio can be overdubbed on top of the game audio. Video of a celebrity announcer can be overlaid on the games, perhaps just on spectator views, as well.
Acceleration of Web Page Loading
The World Wide Web and its primary transport protocol, Hypertext Transfer Protocol (HTTP), were conceived and defined in an era where only businesses had high speed Internet connections, and the consumers who were online were using dialup modems or ISDN. At the time, the “gold standard” for a fast connection was a T1 line which provided 1.5 Mbps data rate symmetrically (i.e., with equal data rate in both directions).
Today, the situation is completely different. The average home connection speed through DSL or cable modem connections in much of the developed world has a far higher downstream data rate than a T1 line. In fact, in some parts of the world, fiber-to-the-curb is bringing data rates as high as 50 to 100 Mbps to the home.
Unfortunately, HTTP was not architected (nor has it been implemented) to effectively take advantage of these dramatic speed improvements. A web site is a collection of files on a remote server. In very simple terms, HTTP requests the first file, waits for the file to be downloaded, and then requests the second file, waits for the file to be downloaded, etc. In fact, HTTP allows for more than one “open connection”, i.e., more than one file to be requested at a time, but because of agreed-upon standards (and a desire to prevent web servers from being overloaded) only very few open connections are permitted. Moreover, because of the way Web pages are constructed, browsers often are not aware of multiple simultaneous pages that could be available to download immediately (i.e., only after parsing a page does it become apparent that a new file, like an image, needs to be downloaded). Thus, files on website are essentially loaded one-by-one. And, because of the request-and-response protocol used by HTTP, there is roughly (accessing typical web servers in the US) a 100 ms latency associated with each file that is loaded.
With relatively low speed connections, this does not introduce much of a problem because the download time for the files themselves dominates the waiting time for the web pages. But, as connection speeds grow, especially with complex web pages, problems begin to arise.
In the example shown inFIG. 24, a typical commercial website is shown (this particular website was from a major athletic shoe brand). The website has 54 files on it. The files include HTML, CSS, JPEG, PHP, JavaScript and Flash files, and include video content. A total of 1.5 MBytes must be loaded before the page is live (i.e., the user can click on it and begin to use it). There are a number of reasons for the large number of files. For one thing, it is a complex and sophisticated webpage, and for another, it is a webpage that is assembled dynamically based on the information about the user accessing the page (e.g., what country the user is from, what language, whether the user has made purchases before, etc.), and depending on all of these factors, different files are downloaded. Still, it is a very typical commercial web page.
FIG. 24shows the amount of time that elapses before the web page is live as the connection speed grows. With a 1.5 Mbps connection speed2401, using a conventional web server with a convention web browser, it takes 13.5 seconds until the web page is live. With a 12 Mbps connection speed2402, the load time is reduced to 6.5 seconds, or about twice as fast. But with a 96 Mbps connection speed2403, the load time is only reduced to about 5.5 seconds. The reason why is because at such a high download speed, the time to download the files themselves is minimal, but the latency per file, roughly 100 ms each, still remains, resulting in 54 files*100 ms=5.4 seconds of latency. Thus, no matter how fast the connection is to the home, this web site will always take at least 5.4 seconds until it is live. Another factor is the server-side queuing; every HTTP request is added in the back of the queue, so on a busy server this will have a significant impact because for every small item to get from the web server, the HTTP requests needs to wait for its turn.
One way to solve these issues is to discard or redefine HTTP. Or, perhaps to get the website owner to better consolidate its files into a single file (e.g., in Adobe Flash format). But, as a practical matter, this company, as well as many others has a great deal of investment in their web site architecture. Further, while some homes have 12-100 Mbps connections, the majority of homes still have slower speeds, and HTTP does work well at slow speed.
One alternative is to host web browsers on app/game servers1521-1525, and host the files for the web servers on the RAID arrays1511-1512(or potentially in RAM or on local storage on the app/game servers1521-1525hosting the web browsers. Because of the very fast interconnect through the inbound routing1502(or to local storage), rather than have 100 ms of latency per file using HTTP, there will be de minimis latency per file using HTTP. Then, instead of having the user in her home accessing the web page through HTTP, the user can access the web page through client415. Then, even with a 1.5 Mbps connection (because this web page does not require much bandwidth for its video), the webpage will be live in less than 1 second per line2400. Essentially, there will be no latency before the web browser running on an app/game server1521-1525is displaying a live page, and there will be no detectable latency before the client415displays the video output from the web browser. As the user mouses around and/or types on the web page, the user's input information will be sent to the web browser running on the app/game server1521-1525, and the web browser will respond accordingly.
One disadvantage to this approach is if the compressor is constantly transmitting video data, then bandwidth is used, even if the web page becomes static. This can be remedied by configuring the compressor to only transmit data when (and if) the web page changes, and then, only transmit data to the parts of the page that change. While there are some web pages with flashing banners, etc. that are constantly changing, such web pages tend to be annoying, and usually web pages are static unless there is a reason for something to be moving (e.g., a video clip). For such web pages, it is likely the case the less data will be transmitted using the hosting service210than a conventional web server because only the actual displayed images will be transmitted, no thin client executable code, and no large objects that may never be viewed, such as rollover images.
Thus, using the hosting service210to host legacy web pages, web page load times can be reduced to the point where opening a web page is like changing channels on a television: the web page is live effectively instantly.
Facilitating Debugging of Games and Applications
As mentioned previously, video games and applications with real-time graphics are very complex applications and typically when they are released into the field they contain bugs. Although software developers will get feedback from users about bugs, and they may have some means to pass back machine state after crashes, it is very difficult to identify exactly what has caused a game or real-time application to crash or to perform improperly.
When a game or application runs in the hosting service210, the video/audio output of the game or application is constantly recorded on a delay buffer1515. Further, a watchdog process runs on each app/game server1521-1525which reports regularly to the hosting service control system401that the app/game server1521-1525is running smoothly. If the watchdog process fails to report in, then the server control system401will attempt to communicate with the app/game server1521-1525, and if successful, will collect whatever machine state is available. Whatever information is available, along with the video/audio recorded by the delay buffer1515will be sent to the software developer.
Thus, when the game or application software developer gets notification of a crash from the hosting service210, it gets a frame-by-frame record of what led up to the crash. This information can be immensely valuable in tracking down bugs and fixing them.
Note also, that when an app/game server1521-1525crashes, the server is restarted at the most recent restartable point, and a message is provided to the user apologizing for the technical difficulty.
Resource Sharing and Cost Savings
The system shown inFIGS. 4aand 4bprovide a variety of benefits for both end users and game and application developers. For example, typically, home and office client systems (e.g., PCs or game consoles) are only in use for a small percentage of the hours in a week. According to an Oct. 5, 2006 press release by the Nielsen Entertainment “Active Gamer Benchmark Study” (http://www.prnewswire.com/cgi-bin/stories.pl?ACCT=104&STORY=/www/story/10-05-2006/0004446115&EDATE=) active gamers spend on average 14 hours a week playing on video game consoles and about 17 hours a week on handhelds. The report also states that for all game playing activity (including console, handheld and PC game playing) Active Gamers average 13 hours a week. Taking into consideration the higher figure of console video game playing time, there are 24*7=168 hours in a week, that implies that in an active gamer's home, a video game console is in use only 17/168=10% of the hours of a week. Or, 90% of the time, the video game console is idle. Given the high cost of video game consoles, and the fact that manufacturers subsidize such devices, this is a very inefficient use of an expensive resource. PCs within businesses are also typically used only a fraction of the hours of the week, especially non-portable desktop PCs often required for high-end applications such as Autodesk Maya. Although some businesses operate at all hours and on holidays, and some PCs (e.g., portables brought home for doing work in the evening) are used at all hours and holidays, most business activities tend to center around 9 AM to 5 PM, in a given business' time zone, from Monday to Friday, less holidays and break times (such as lunch), and since most PC usage occurs while the user is actively engaged with the PC, it follows that desktop PC utilization tends to follow these hours of operation. If we were to assume that PCs are utilized constantly from 9 AM to 5 PM, 5 days a week, that would imply PCs are utilized 40/168=24% of the hours of the week. High-performance desktop PCs are very expensive investments for businesses, and this reflects a very low level of utilization. Schools that are teaching on desktop computers may use computers for an even smaller fraction of the week, and although it varies depending upon the hours of teaching, most teaching occurs during the daytime hours from Monday through Friday. So, in general, PCs and video game consoles are utilized only a small fraction of the hours of the week.
Notably, because many people are working at businesses or at school during the daytime hours of Monday through Friday on non-holidays, these people generally are not playing video games during these hours, and so when they do play video games it is generally during other hours, such as evenings, weekends and on holidays.
Given the configuration of the hosting service shown inFIG. 4a, the usage patterns described in the above two paragraphs result in very efficient utilization of resources. Clearly, there is a limit to the number of users who can be served by the hosting service210at a given time, particularly if the users are requiring real-time responsiveness for complex applications like sophisticated 3D video games. But, unlike a video game console in a home or a PC used by a business, which typically sits idle most of the time, servers402can be re-utilized by different users at different times. For example, a high-performance server402with high performance dual CPUs and dual GPUs and a large quantity of RAM can be utilized by a businesses and schools from 9 AM to 5 PM on non-holidays, but be utilized by gamers playing a sophisticated video game in the evenings, weekends and on holidays. Similarly, low-performance applications can be utilized by businesses and schools on a low-performance server402with a Celeron CPU, no GPU (or a very low-end GPU) and limited RAM during business hours and a low-performance game can utilize a low-performance server402during non-business hours.
Further, with the hosting service arrangement described herein, resources are shared efficiently among thousands, if not millions, of users. In general, online services only have a small percentage of their total user base using the service at a given time. If we consider the Nielsen video game usage statistics listed previously, it is easy to see why. If active gamers play console games only 17 hours of a week, and if we assume that the peak usage time for game is during the typical non-work, non-business hours of evenings (5-12 AM, 7*5 days=35 hours/week) and weekend (8 AM-12 AM, 16*2=32 hours/week), then there are 35+32=65 peak hours a week for 17 hours of game play. The exact peak user load on the system is difficult to estimate for many reasons: some users will play during off-peak times, there may be certain day times when there are clustering peaks of users, the peak times can be affected by the type of game played (e.g., children's games will likely be played earlier in the evening), etc. But, given that the average number of hours played by a gamer is far less than the number of hours of the day when a gamer is likely to play a game, only a fraction of the number of users of the hosting service210will be using it at a given time. For the sake of this analysis, we shall assume the peak load is 12.5%. Thus, only 12.5% of the computing, compression and bandwidth resources are used at a given time, resulting in only 12.5% of the hardware cost to support a given user to play a given level of performance game due to reuse of resources.
Moreover, given that some games and applications require more computing power than others, resources may be allocated dynamically based on the game being played or the applications executed by users. So, a user selecting a low-performance game or application will be allocated a low-performance (less expensive) server402, and a user selecting a high-performance game or applications will be allocated a high-performance (more expensive) server402. Indeed, a given game or application may have lower-performance and higher-performance sections of the game or applications, and the user can be switched from one server402to another server402between sections of the game or application to keep the user running on the lowest-cost server402that meets the game or application's needs. Note that the RAID arrays405, which will be far faster than a single disk, will be available to even low-performance servers402, that will have the benefit of the faster disk transfer rates. So, the average cost per server402across all of the games being played or applications being used is much less than the cost of the most expensive server402that plays the highest performance game or applications, yet even the low-performance servers402, will derive disk performance benefits from the RAID arrays405.
Further, a server402in the hosting service210may be nothing more than a PC motherboard without a disk or peripheral interfaces other than a network interface, and in time, may be integrated down to a single chip with just a fast network interface to the SAN403. Also, RAID Arrays405likely will be shared amongst far many more users than there are disks, so the disk cost per active user will be far less than one disk drive. All of this equipment will likely reside in a rack in an environmentally-controlled server room environment. If a server402fails, it can be readily repaired or replaced at the hosting service210. In contrast, a PC or game console in the home or office must be a sturdy, standalone appliance that has to be able to survive reasonable wear and tear from being banged or dropped, requires a housing, has at least one disk drive, has to survive adverse environment conditions (e.g., being crammed into an overheated AV cabinet with other gear), requires a service warranty, has to be packaged and shipped, and is sold by a retailer who will likely collect a retail margin. Further, a PC or game console must be configured to meet the peak performance of the most computationally-intensive anticipated game or application to be used at some point in the future, even though lower performance games or application (or sections of games or applications) may be played most of the time. And, if the PC or console fails, it is an expensive and time-consuming process (adversely impacting the manufacturer, user and software developer) to get it repaired.
Thus, given that the system shown inFIG. 4aprovides an experience to the user comparable to that of a local computing resource, for a user in the home, office or school to experience a given level of computing capability, it is much less expensive to provide that computing capability through the architecture shown inFIG. 4a.
Eliminating the Need to Upgrade
Further, users no longer have to worry about upgrading PCs and/or consoles to play new games or handle higher performance new applications. Any game or applications on the hosting service210, regardless of what type of server402is required for that game or applications, is available to the user, and all games and applications run nearly instantly (e.g., loading rapidly from the RAID Arrays405or local storage on a servers402) and properly with the latest updates and bug fixes (i.e., software developers will be able to choose an ideal server configuration for the server(s)402that run(s) a given game or application, and then configure the server(s)402with optimal drivers, and then over time, the developers will be able to provide updates, bug fixes, etc. to all copies of the game or application in the hosting service210at once). Indeed, after the user starts using the hosting service210, the user is likely to find that games and applications continue to provide a better experience (e.g., through updates and/or bug fixes) and it may be the case that a user discovers a year later that a new game or application is made available on the service210that is utilizing computing technology (e.g., a higher-performance GPU) that did not even exist a year before, so it would have been impossible for the user to buy the technology a year before that would play the game or run the applications a year later. Since the computing resource that is playing the game or running the application is invisible to the user (i.e., from the user's perspective the user is simply selecting a game or application that begins running nearly instantly—much as if the user had changed channels on a television), the user's hardware will have been “upgraded” without the user even being aware of the upgrade.
Eliminating the Need for Backups
Another major problem for users in businesses, schools and homes are backups. Information stored in a local PC or video game console (e.g., in the case of a console, a user's game achievements and ranking) can be lost if a disk fails, or if there is an inadvertent erasure. There are many applications available that provide manual or automatic backups for PCs, and game console state can be uploaded to an online server for backup, but local backups are typically copied to another local disk (or other non-volatile storage device) which has to be stored somewhere safe and organized, and backups to online services are often limited because of the slow upstream speed available through typical low-cost Internet connections. With the hosting service210ofFIG. 4a, the data that is stored in RAID arrays405can be configured using prior art RAID configuration techniques well-known to those skilled in the art such that if a disk fails, no data will be lost, and a technician at the server center housing the failed disk will be notified, and then will replace the disk, which then will be automatically updated so that the RAID array is once again failure tolerant. Further, since all of the disk drives are near one another and with fast local networks between them through the SAN403it is not difficult in a server center to arrange for all of the disk systems to be backed up on a regular basis to secondary storage, which can be either stored at the server center or relocated offsite. From the point of view of the users of hosting service210, their data is simply secure all the time, and they never have to think about backups.
Access to Demos
Users frequently want to try out games or applications before buying them. As described previously, there are prior art means by which to demo (the verb form of “demo” means to try out a demonstration version, which is also called a “demo”, but as a noun) games and applications, but each of them suffers from limitations and/or inconveniences. Using the hosting service210, it is easy and convenient for users to try out demos. Indeed, all the user does is select the demo through a user interface (such as one described below) and try out the demo. The demo will load almost instantly onto a server402appropriate for the demo, and it will just run like any other game or application. Whether the demo requires a very high performance server402, or a low performance server402, and no matter what type of home or office client415the user is using, from the point of view of the user, the demo will just work. The software publisher of either the game or application demo will be able to control exactly what demo the user is permitted to try out and for how long, and of course, the demo can include user interface elements that offer the user an opportunity to gain access to a full version of the game or application demonstrated.
Since demos are likely to be offered below cost or free of charge, some users may try to use demos repeated (particularly game demos, which may be fun to play repeatedly). The hosting service210can employ various techniques to limit demo use for a given user. The most straightforward approach is to establish a user ID for each user and limit the number of times a given user ID is allowed to play a demo. A user, however, may set up multiple user IDs, especially if they are free. One technique for addressing this problem is to limit the number of times a given client415is allowed to play a demo. If the client is a standalone device, then the device will have a serial number, and the hosting service210can limit the number of times a demo can be accessed by a client with that serial number. If the client415is running as software on a PC or other device, then a serial number can be assigned by the hosting service210and stored on the PC and used to limit demo usage, but given that PCs can be reprogrammed by users, and the serial number erased or changed, another option is for the hosting service210to keep a record of the PC network adapter Media Access Control (MAC) address (and/or other machine specific identifiers such as hard-drive serial numbers, etc.) and limit demo usage to it. Given that the MAC addresses of network adapters can be changed, however, this is not a foolproof method. Another approach is to limit the number of times a demo can be played to a given IP address. Although IP addresses may be periodically reassigned by cable modem and DSL providers, it does not happen in practice very frequently, and if it can be determined (e.g., by contacting the ISP) that the IP is in a block of IP addresses for residential DSL or cable modem accesses, then a small number of demo uses can typically be established for a given home. Also, there may be multiple devices at a home behind a NAT router sharing the same IP address, but typically in a residential setting, there will be a limited number of such devices. If the IP address is in a block serving businesses, then a larger number of demos can be established for a business. But, in the end, a combination of all of the previously mentioned approaches is the best way to limit the number of demos on PCs. Although there may be no foolproof way that a determined and technically adept user can be limited in the number of demos played repeatedly, creating a large number of barriers can create a sufficient deterrent such that it's not worth the trouble for most PC users to abuse the demo system, and rather they will use the demos as they were intended to be used: to try out new games and applications.
Benefits to Schools, Businesses and Other Institutions
Significant benefits accrue particularly to businesses, schools and other institutions that utilize the system shown inFIG. 4a. Businesses and schools have substantial costs associated with installing, maintaining and upgrading PCs, particularly when it comes to PCs for running high-performance applications, such a Maya. As stated previously, PCs are generally utilized only a fraction of the hours of the week, and as in the home, the cost of PC with a given level of performance capability is far higher in an office or school environment than in a server center environment.
In the case of larger businesses or schools (e.g., large universities), it may be practical for the IT departments of such entities to set up server centers and maintain computers that are remotely accessed via LAN-grade connections. A number of solutions exist for remote access of computers over a LAN or through a private high bandwidth connection between offices. For example, with Microsoft's Windows Terminal Server, or through virtual network computing applications like VNC, from RealVNC, Ltd., or through thin client means from Sun Microsystems, users can gain remote access to PCs or servers, with a range of quality in graphics response time and user experience. Further, such self-managed server centers are typically dedicated for a single business or school and as such, are unable to take advantage of the overlap of usage that is possible when disparate applications (e.g., entertainment and business applications) utilize the same computing resources at different times of the week. So, many businesses and schools lack the scale, resources or expertise to set up a server center on their own that has a LAN-speed network connection to each user. Indeed, a large percentage of schools and businesses have the same Internet connections (e.g., DSL, cable modems) as homes.
Yet such organizations may still have the need for very high-performance computing, either on a regular basis or on a periodic basis. For example, a small architectural firm may have only a small number of architects, with relatively modest computing needs when doing design work, but it may require very high-performance 3D computing periodically (e.g., when creating a 3D fly-through of a new architectural design for a client). The system shown inFIG. 4ais extremely well suited for such organizations. The organizations need nothing more than the same sort of network connection that are offered to homes (e.g., DSL, cable modems) and are typically very inexpensive. They can either utilize inexpensive PCs as the client415or dispense with PCs altogether and utilize inexpensive dedicated devices which simply implement the control signal logic413and low-latency video decompression412. These features are particularly attractive for schools that may have problems with theft of PCs or damage to the delicate components within PCs.
Such an arrangement solves a number of problems for such organizations (and many of these advantages are also shared by home users doing general-purpose computing). For one, the operating cost (which ultimately must be passed back in some form to the users in order to have a viable business) can be much lower because (a) the computing resources are shared with other applications that have different peak usage times during the week, (b) the organizations can gain access to (and incur the cost of) high performance computing resources only when needed, (c) the organizations do not have to provide resources for backing up or otherwise maintaining the high performance computing resources.
Elimination of Piracy
In addition, games, applications, interactive movies, etc., can no longer be pirated as they are today. Because each game is stored and executed at the hosting service210, users are not provided with access to the underlying program code, so there is nothing to pirate. Even if a user were to copy the source code, the user would not be able to execute the code on a standard game console or home computer. This opens up markets in places of the world such as China, where standard video gaming is not made available. The re-sale of used games is also not possible because there are no copies of a games distributed to users.
For game developers, there are fewer market discontinuities as is the case today when new generations of game consoles or PCs are introduced to the market. The hosting service210can be gradually updated with more advanced computing technology over time as gaming requirements change, in contrast to the current situation where a completely new generation of console or PC technology forces users and developers to upgrade and the game developer is dependent on the timely delivery of the hardware platform to the user (e.g. in the case of the PlayStation 3, its introduction was delayed by more than a year, and developers had to wait until it was available and significant numbers of units were purchased).
Streaming Interactive Video
The above descriptions provide a wide range of applications enabled by the novel underlying concept of general Internet-based, low-latency streaming interactive video (which implicitly includes audio together with the video as well, as used herein). Prior art systems that have provided streaming video through the Internet only have enabled applications which can be implemented with high latency interactions. For example, basic playback controls for linear video (e.g. pause, rewind, fast forward) work adequately with high latency, and it is possible to select among linear video feeds. And, as stated previously, the nature of some video games allow them to be played with high latency. But the high latency (or low compression ratio) of prior art approaches for streaming video have severely limited the potential applications of streaming video or narrowed their deployments to specialized network environments, and even in such environments, prior art techniques introduce substantial burdens on the networks. The technology described herein opens the door for the wide range of applications possible with low-latency streaming interactive video through the Internet, particularly those enabled through consumer-grade Internet connections.
Indeed, with client devices as small as client465ofFIG. 4csufficient to provide an enhanced user experience with an effectively arbitrary amount of computing power, arbitrary amount of fast storage, and extremely fast networking amongst powerful servers, it enables a new era of computing. Further, because the bandwidth requirements do not grow as the computing power of the system grows (i.e., because the bandwidth requirements are only tied to display resolution, quality and frame rate), once broadband Internet connectivity is ubiquitous (e.g., through widespread low-latency wireless coverage), reliable, and of sufficiently high bandwidth to meet the needs of the display devices422of all users, the question will be whether thick clients (such as PCs or mobile phones running Windows, Linux, OSX, etc.,) or even thin clients (such as Adobe Flash or Java) are necessary for typical consumer and business applications.
The advent of streaming interactive video results in a rethinking of assumptions about the structure of computing architectures. An example of this is the hosting service210server center embodiment shown inFIG. 15. The video path for delay buffer and/or group video1550is a feedback loop where the multicasted streaming interactive video output of the app/game servers1521-1525is fed back into the app/game servers1521-1525either in real-time via path1552or after a selectable delay via path1551. This enables a wide range of practical applications (e.g. such as those illustrated inFIGS. 16, 17 and 20) that would be either impossible or infeasible through prior art server or local computing architectures. But, as a more general architectural feature, what feedback loop1550provides is recursion at the streaming interactive video level, since video can be looped back indefinitely as the application requires it. This enables a wide range of application possibilities never available before.
Another key architectural feature is that the video streams are unidirectional UDP streams. This enables effectively an arbitrary degree of multicasting of streaming interactive video (in contrast, two-way streams, such as TCP/IP streams, would create increasingly more traffic logjams on the networks from the back-and-forth communications as the number of users increased). Multicasting is an important capability within the server center because it allows the system to be responsive to the growing needs of Internet users (and indeed of the world's population) to communicate on a one-to-many, or even a many-to-many basis. Again, the examples discussed herein, such asFIG. 16which illustrates the use of both streaming interactive video recursion and multicasting are just the tip of a very large iceberg of possibilities.
Non-Transit Peering
In one embodiment, the hosting service210has one or more peering connections to one or more Internet Service Providers (ISP) who also provide Internet service to users, and in this way the hosting service210may be able to communicate with the user through a non-transit route that stays within that ISP's network. For example, if the hosting service210WAN Interface441directly connected to Comcast Cable Communications, Inc.'s network, and the user premises211was provisioned with broadband service with a Comcast cable modem, a route between the hosting service210and client415may be established entirely within Comcast's network. The potential advantages of this would include lower cost for the communications (since the IP transit costs between two or more ISP networks might be avoided), a potentially more reliable connection (in case there were congestion or other transit disruptions between ISP networks), and lower latency (in case there were congestion, inefficient routes or other delays between ISP networks).
In this embodiment, when the client415initially contacts the hosting service210at the beginning of a session, the hosting service210receives the IP address of the user premises211. It then uses available IP address tables, e.g., from ARIN (American Registry for Internet Numbers), to see if the IP address is one allocated to a particular ISP connected to the hosting service210that can route to the user premises211without IP transit to through another ISP. For example, if the IP address was between 76.21.0.0 and 76.21.127.255, then the IP address is assigned to Comcast Cable Communications, Inc. In this example, if the hosting service210maintains connections to Comcast, AT&T and Cox ISPs, then it selects Comcast as the ISP most likely to provide an optimal route to the particular user.
Video Compression Using Feedback
In one embodiment, feedback is provided from the client device to the hosting service to indicate successful (or unsuccessful) tile and/or frame delivery. The feedback information provided from the client is then used to adjust the video compression operations at the hosting service.
For example,FIGS. 25a-billustrate one embodiment of the invention in which a feedback channel2501is established between the client device205and the hosting service210. The feedback channel2501is used by the client device205to send packetized acknowledgements of successfully received tiles/frames and/or indications of unsuccessfully received tiles/frames.
In one embodiment, after successfully receiving each tile/frame, the client transmits an acknowledgement message to the hosting service210. In this embodiment, the hosting service210detects a packet loss if it does not receive an acknowledgement after a specified period of time and/or if it receives an acknowledgement that the client device205has received a subsequent tile/frame than one that had been sent. Alternatively, or in addition, the client device205may detect the packet loss and transmit an indication of the packet loss to the hosting service210along with an indication of the tiles/frames affected by the packet loss. In this embodiment, continuous acknowledgement of successfully delivered tiles/frames is not required.
Regardless of how a packet loss is detected, in the embodiment illustrated inFIGS. 25a-b, after generating an initial set of I-tiles for an image (not shown inFIG. 25a), the encoder subsequently generates only P-tiles until a packet loss is detected. Note that inFIG. 25a, each frame, such as2510is illustrated as 4 vertical tiles. The frame may be tiled in a different configuration, such as a 2×2, 2×4, 4×4, etc., or the frame may be encoded in its entirety with no tiles (i.e. as 1 large tile). The foregoing examples of frame tiling configurations are provided for the purpose of illustration of this embodiment of the invention. The underlying principles of the invention are not limited to any particular frame tiling configuration.
Transmitting only P-tiles reduces the bandwidth requirements of the channel for all of the reasons set forth above (i.e., P-tiles are generally smaller than I-tiles). When a packet loss is detected via the feedback channel2501, new I-tiles are generated by the encoder2500, as illustrated inFIG. 25b, to re-initialize the state of the decoder2502on the client device205. As illustrated, in one embodiment, the I-tiles are spread across multiple encoded frames to limit the bandwidth consumed by each individual encoded frame. For example, inFIG. 25, in which each frame includes 4 tiles, a single I-tile is transmitted at a different position within 4 successive encoded frames.
The encoder2500may combine the techniques described with respect to this embodiment with other encoding techniques described herein. For example, in addition to generating I-tiles in response to a detected packet loss the encoder2500may generate I-tiles in other circumstances in which I-tiles may be beneficial to properly render the sequence of images (such as in response to sudden scene transitions).
FIG. 26aillustrates another embodiment of the invention which relies on a feedback channel2601between the client device205and the hosting service210. Rather than generating new I-tiles/frames in response to a detected packet loss, the encoder2600of this embodiment adjusts the dependencies of the P-tiles/frames. As an initial matter, it should be noted that the specific details set forth in this example are not required for complying with the underlying principles of the invention. For example, while this example will be described using P-tiles/frames, the underlying principles of the invention are not limited to any particular encoding format.
InFIG. 26a, the encoder2600encodes a plurality of uncompressed tiles/frames2605into a plurality of P-tiles/frames2606and transmits the P-tiles/frames over a communication channel (e.g., the Internet) to a client device205. A decoder2602on the client device205decodes the P-tiles/frames2606to generate a plurality of decompressed tiles/frames2607. The past state(s)2611of the encoder2600is stored within a memory device2610on the hosting service210and the past state(s)2621of the decoder2602is stored within a memory device2620on the client device205. The “state” of a decoder is a well known term of art in video coding systems such as MPEG-2 and MPEG-4. In one embodiment, the past “state” stored within the memories comprises the combined data from prior P-tiles/frames. The memories2611and2621may be integrated within the encoder2600and decoder2602, respectively, rather than being detached from the encoder2600and decoder2602, as shown inFIG. 26a. Moreover, various types of memory may be used including, by way of example and not limitation, random access memory.
In one embodiment, when no packet loss occurs, the encoder2600encodes each P-tile/frame to be dependent on the previous P-tile/frame. Thus, as indicated by the notation used inFIG. 26a, P-tile/frame4is dependent on P-tile/frame3(identified using the notation43); P-tile/frame5is dependent on P-tile/frame4(identified using the notation54); and P-tile/frame6is dependent on P-tile/frame5(identified using the notation65). In this example, P-tile/frame43has been lost during transmission between the encoder2600and the decoder2602. The loss may be communicated to the encoder2600in various ways including, but not limited to, those described above. For example, each time the decoder2606successfully receives and/or decodes a tile/frame, this information may be communicated from the decoder2602to the encoder2600. If the encoder2600does not receive an indication that a particular tile/frame has been received and/or decoded after a period of time, then the encoder2600will assume that the tile/frame has not been successfully received. Alternatively, or in addition, the decoder2602may notify the encoder2600when a particular tile/frame is not successfully received.
In one embodiment, regardless of how the lost tile/frame is detected, once it is, the encoder2600encodes the next tile/frame using the last tile/frame known to have been successfully received by the decoder2602. In the example shown inFIG. 26a, tiles/frames5and6are not considered “successfully received” because they cannot be properly decoded by the decoder2602due to the loss of tile/frame4(i.e., the decoding of tile/frame5depends on tile/frame4and the decoding of tile/frame6depends on tile/frame5). Thus, in the example shown inFIG. 26a, the encoder2600encodes tile/frame7to be dependent on tile/frame3(the last successfully received tile/frame) rather than tile/frame6which the decoder2602cannot properly decode. Although not illustrated inFIG. 26a, tile/frame8will subsequently be encoded to be dependent on tile/frame7and tile/frame9will be encoded to be dependent on tile/frame8, assuming that no additional packet losses are detected.
As mentioned above, both the encoder2600and the decoder2602maintain past encoder and decoder states,2611and2621, within memories2610and2620, respectively. Thus, when encoding tile/frame7, the encoder2600retrieves the prior encoder state associated with tile/frame3from memory2610. Similarly, the memory2620associated with decoder2602stores at least the last known good decoder state (the state associated with P-tile/frame3in the example). Consequently, the decoder2602retrieves the past state information associated with tile/frame3so that tile/frame7can be decoded.
As a result of the techniques described above, real-time, low latency, interactive video can be encoded and streamed using relatively small bandwidth because no I-tiles/frames are ever required (except to initialize the decoder and encoder at the start of the stream). Moreover, while the video image produced by the decoder may temporarily include undesirable distortion resulting from lost tile/frame4and tiles/frames5and6(which cannot be properly decoded due to the loss of tile/frame4), this distortion will be visible for a very short duration. Moreover, if tiles are used (rather than full video frames), the distortion will be limited to a particular region of the rendered video image.
A method according to one embodiment of the invention is illustrated inFIG. 26b. At2650, a tile/frame is generated based on a previously-generated tile/frame. At2651, a lost tile/frame is detected. In one embodiment, the lost tile/frame is detected based on information communicated from the encoder to the decoder, as described above. At2652, the next tile/frame is generated based on a tile/frame which is known to have been successfully received and/or decoded at the decoder. In one embodiment, the encoder generates the next tile/frame by loading the state associated with the successfully received and/or decoded tile/frame from memory. Similarly, when the decoder receives the new tile/frame, it decodes the tile/frame by loading the state associated with the successfully received and/or decoded tile/frame from memory.
In one embodiment the next tile/frame is generated based upon the last tile/frame successfully received and/or decoded at the encoder. In another embodiment, the next tile/frame generated is an I tile/frame. In yet another embodiment, the choice of whether to generate the next tile/frame based on a previously successfully received tile/frame, or as an I frame, is based on the how many tile/frames were lost and/or the latency of the channel. In the situation where a relatively small number (e.g., 1 or 2) tile/frames are lost and the round-trip latency is relatively low (e.g. 1 or 2 frame times), then it may be optimal to generate a P tile/frame since the difference between the last successfully received tile/frame and the newly generated one may be relatively small. If several tile/frames are lost or the round-trip latency is high, then it may be optimal to generate an I tile/frame since the difference between the last successfully received tile/frame and the newly generated one may be large. In one embodiment, a tile/frame loss threshold and/or a latency threshold value is set to determine whether to transmit an I tile/frame or a P tile/frame. If the number of lost tiles/frames is below the tile/frame loss threshold and/or if the round trip latency is below the latency threshold value, then a new I tile/frame is generated; otherwise, a new P tile/frame is generated.
In one embodiment, the encoder always attempts to generate a P tile/frame relative to the last successfully received tile/frame, and if in the encoding process the encoder determines that the P tile/frame will likely be larger than an I tile/frame (e.g. if it has compressed ⅛thof the tile/frame and the compressed size is larger than ⅛thof the size of the average I tile/frame previously compressed), then the encoder will abandon compressing the P tile/frame and will instead compress an I tile/frame.
If lost packets occur infrequently, the systems described above using feedback to report a dropped tile/frame typically results in a very slight disruption in the video stream to the user because a tile/frame that was disrupted by a lost packet is replaced in roughly the time of one round trip between the client device205and hosting service210assuming the encoder2600compresses the tile/frame in a short amount of time. And, because the new tile/frame that is compressed is based upon a later frame in the uncompressed video stream, the video stream does not fall behind the uncompressed video stream. But, if a packet containing the new tile/frame also is lost, then this results in a delay of least two round trips to yet again request and send another new tile/frame, which in many practical situations will result in a noticeable disruption to the video stream. As a consequence, it is very important that the newly-encoded tile/frame sent after dropped tile/frame is successfully sent from the hosting service210to the client device205.
In one embodiment, forward-error correction (FEC) coding techniques, such as those previously described and illustrated inFIGS. 11a, 11b, 11cand 11d, are used to mitigate the probability of losing the newly-encoded tile/frame. If FEC coding is already being used when transmitting tiles/frames, then a stronger FEC code is used for the newly-encoded tile/frame.
One potential cause of dropped packets is a sudden loss in channel bandwidth, for example, if some other user of the broadband connection at the user premises211starts using a large amount of bandwidth. If a newly-generated tile/frame also is lost due to dropped packets (even if FEC is used), then in one embodiment when hosting service210is notified by client415that a second newly encoded tile/frame is dropped, video compressor404reduces the data rate when it encodes a subsequent newly encoded tile/frame. Different embodiments reduce the data rate using different techniques. For example, in one embodiment, this data rate reduction is accomplished by lowering the quality of the encoded tile/frame by increasing the compression ratio. In another embodiment, the data rate is reduced by lowering the frame rate of the video (e.g. from 60 fps to 30 fps) and accordingly slowing the rate of data transmission. In one embodiment, both techniques for reducing the data rate are used (e.g., both reducing the frame rate and increasing the compression ratio). If this lower rate of data transmission is successful at mitigating the dropped packets, then in accordance the channel data rate detection and adjustment methods previously described, the hosting service210will continue encoding at a lower data rate, and then gradually adjust the data rate upward or downward as the channel will allow. The continuous receipt of feedback data related to dropped packets and/or latency allow the hosting service210to dynamically adjust the data rate based on current channel conditions.
State Management in an Online Gaming System
One embodiment of the invention employs techniques to efficiently store and port the current state of an active game between servers. While the embodiments described herein are related to online gaming, the underlying principles of the invention may be used for various other types of applications (e.g., design applications, word processors, communication software such as email or instant messaging, etc.).FIG. 27aillustrates an example system architecture for implementing this embodiment andFIG. 27billustrates an example method. While the method and system architecture will be described concurrently, the method illustrated inFIG. 27bis not limited to any particular system architecture.
At2751ofFIG. 27b, a user initiates a new online game on a hosting service210afrom a client device205. In response, at2752, a “clean” image of the game2702ais loaded from storage (e.g., a hard drive, whether connected directly to a server executing the game, or connected to a server through a network) to memory (e.g., RAM) on the hosting service210a. The “clean” image comprises the runtime program code and data for the game prior to the initiation of any game play (e.g., as when the game is executed for the first time). The user then plays the game at2753, causing the “clean” image to change to a non-clean image (e.g., an executing game represented by “State A” inFIG. 27a). At2754, the game is paused or terminated, either by the user or the hosting service210a. At2755, state management logic2700aon the hosting service210adetermines the differences between the “clean” image of the game and the current game state (“State A”). Various known techniques may be used to calculate the difference between two binary images including, for example, those used in the well known “diff” utility available on the UNIX operating system. Of course, the underlying principles of the invention are not limited to any particular techniques for difference calculation.
Regardless of how the differences are calculated, once they are, the difference data is stored locally within a storage device2705aand/or transmitted to a different hosting service210b. If transmitted to a different hosting service210b, the difference data may be stored on a storage device (not shown) at the new hosting service210b. In either case, the difference data is associated with the user's account on the hosting services so that it may be identified the next time the user logs in to the hosting services and initiates the game. In one embodiment, rather than being transmitted immediately, the difference data is not transmitted to a new hosting service until the next time the user attempts to play the game (and a different hosting service is identified as the best choice for hosting the game).
Returning to the method shown inFIG. 27b, at2757, the user reinitiates the game from a client device, which may be the same client device205from which the user initially played the game or a different client device (not shown). In response, at2758, state management logic2700bon the hosting service210bretrieves the “clean” image of the game from a storage device and the difference data. At2759, the state management logic2700bcombines the clean image and difference data to reconstruct the state that the game was in on the original hosting service210a(“State A”). Various known techniques may be used to recreate the state of a binary image using the difference data including, for example, those used in the well known “patch” utility available on the UNIX operating system. The difference calculation techniques used in well known backup programs such as PC Backup may also be used. The underlying principles of the invention are not limited to any particular techniques for using difference data to recreate a binary image.
In addition, at2760, platform-dependent data2710is incorporated into the final game image2701b. The platform-dependent data2710may include any data which is unique to the destination server platform. By way of example, and not limitation, the platform-dependent data2710may include the Medium Access Control (MAC) address of the new platform, the TCP/IP address, the time of day, hardware serial numbers (e.g., for the hard drive and CPU), network server addresses (e.g., DHCP/Wins servers), and software serial number(s)/activation code(s) (including Operating System serial number(s)/activation code(s)).
Other platform-dependent data related to the client/user may include (but is not limited to) the following:
1. The user's screen resolution. When the user resumes the game, the user may be using a different device with a different resolution.
2. The user's controller configuration. When game resumes, the user may have switched from a game controller to a keyboard/mouse.
3. User entitlements, such as whether a discount rate has expired (e.g., if the user was playing the game during a promotional period and is now playing during a normal period at higher cost) or whether the user or device has certain age restrictions (e.g., the parents of the user may have changed the settings for a child so the child is not allowed to see mature material, or if the device playing the game (e.g., a computer at a public library) has certain restrictions on whether mature material can be displayed).
4. The user's ranking. The user may have been allowed to play a multiplayer game in a certain league, but because some other users had exceeded the user's ranking, the user may have been downgraded to a lesser league.
The foregoing examples of platform-dependent data2710are provided for the purpose of illustration of this embodiment of the invention. The underlying principles of the invention are not limited to any particular set of platform-dependent data.
FIG. 28graphically illustrates how the state management logic2700aat the first hosting service extracts difference data2800from the executing game2701a. The state management logic2700bat the second hosting service then combines the clean image2702bwith the difference data2800and platform-dependent data2710to regenerate the state of the executing game2701b. As shown generally inFIG. 28, the size of the difference data is significantly smaller than the size of the entire game image2701aand, consequently, a significant amount of storage space and bandwidth is conserved by storing/transmitting only difference data. Although not shown inFIG. 28, the platform-dependent data2700may overwrite some of the difference data when it is incorporated into the final game image2701b.
While an online video gaming implementation is described above, the underlying principles of the invention are not limited to video games. For example, the foregoing state management techniques may be implemented within the context of any type of online-hosted application.
Techniques for Maintaining a Client Decoder
In one embodiment of the invention, the hosting service210transmits a new decoder to the client device205each time the user requests connect to hosting service210. Consequently, in this embodiment, the decoder used by the client device is always up-to-date and uniquely tailored to the hardware/software implemented on the client device.
As illustrated inFIG. 29, in this embodiment, the application which is permanently installed on the client device205does not include a decoder. Rather, it is a client downloader application2903which manages the download and installation of a temporary decoder2900each time the client device205connects to the hosting service210. The downloader application2903may be implemented in hardware, software, firmware, or any combination thereof. In response to a user request for a new online session, the downloader application2903transmits information related to the client device205over a network (e.g., the Internet). The information may include identification data identifying the client device and/or the client device's hardware/software configuration (e.g., processor, operating system, etc.).
Based on this information, a downloader application2901on the hosting service210selects an appropriate temporary decoder2900to be used on the client device205. The downloader application2901on the hosting service then transmits the temporary decoder2900and the downloader application2903on the client device verifies and/or installs the decoder on the client device205. The encoder2902then encodes the audio/video content using any of the techniques described herein and transmits the content2910to the decoder2900. Once the new decoder2900is installed, it decodes the content for the current online session (i.e., using one or more of the audio/video decompression techniques described herein). In one embodiment, when the session is terminated, the decoder2900is removed (e.g., uninstalled) from the client device205.
In one embodiment the downloader application2903characterizes the channel as the temporary decoder2900is being downloaded by making channel assessments such as the data rate achievable on the channel (e.g. by determining how long it takes for data to download), the packet loss rate on the channel, and the latency of the channel. The downloader application2903generates channel characterization data describing the channel assessments. This channel characterization data is then transmitted from the client device205to the hosting service downloader2901, which uses the channel characterization data to determine how best to utilize the channel to transmit media to the client device205.
The client device205typically will send back messages to the hosting service205during the downloading of the temporary decoder2900. These messages can include acknowledgement messages indicating whether packets were received without errors or with errors. In addition, the messages provide feedback to the downloader2901as to the data rate (calculated based on the rate at which packets are received), the packet error rate (based on the percentage of packets reported received with errors), and the round-trip latency of the channel (based on the amount of time that it takes before the downloader2901receives feedback about a given packet that has been transmitted).
By way of example, if the data rate is determined to be 2 Mbps, then the downloader may choose a smaller video window resolution for the encoder2902(e.g. 640×480 at 60 fps) than if the data rate is determined to be 5 Mbps (e.g. 1280×720 at 60 fps). Different forward error correction (FEC) or packet structures may be chosen, depending on the packet loss rate.
If the packet loss is very low, then the compressed audio and video may be transmitted without any error correction. If the packet loss is medium, then the compressed audio and video may be transmitted with error correction coding techniques (e.g., such as those previously described and illustrated inFIGS. 11a, 11b, 11cand 11d). If the packet loss is very high, it may be determined that an audiovisual stream of adequate quality cannot be transmitted, and the client device205may either notify the user that the hosting service is not available through the communications channel (i.e. the “link”), or it may try to establish a different route to the hosting service that has a lower packet loss (as described below).
If the latency is low, then the compressed audio and video can be transmitted with low latency and a session can be established. If the latency is too high (e.g. higher than 80 ms) then, for games which require low latency, the client device205may either notify the user that the hosting service is not available through the link, that a link is available but the response time to user input will be sluggish or “laggy,” or that the user can try to establish a different route to the hosting service that has a lower latency (as described below).
The Client Device205may try to connect to the Hosting Service210through another route through the network (e.g. the Internet) to see if impairments are reduced (e.g. the packet loss is lower, the latency is lower, or even if the data rate is higher). For example, the Hosting Service210may connect to the Internet from multiple locations geographically (e.g., a hosting center in Los Angeles and one in Denver), and perhaps there is high packet loss due to congestion in Los Angeles, but there is not congestion in Denver. Also, the Hosting Service210may connect to the Internet through multiple Internet service providers (e.g. AT&T and Comcast).
Because of congestion or other issues between the client device205and one of the service providers (e.g. AT&T), packet loss and/or high latency and/or constrained data rate may result. However, if the Client Device205connects to the hosting service210through another service provider (e.g., Comcast), it may be able to connect without congestion problems and/or lower packet loss and/or lower latency and/or higher data rate. Thus, if the client device205experiences packet loss above a specified threshold (e.g., a specified number of dropped packets over a specified duration), latency above a specified threshold and/or a data rate below a specified threshold while downloading the temporary decoder2900, in one embodiment, it attempts to reconnect to the hosting service210through an alternate route (typically by connecting to a different IP address or different domain name) to determine if a better connection can be obtained.
If the connection is still experiencing unacceptable impairments after alternative connection options are exhausted, then it could be that the client device205's local connection to the Internet is suffering from impairments, or that it is too far away from the hosting service210to achieve adequate latency. In such a case the client device205may notify the user that the Hosting Service is not available through the link or that it is only available with impairments, and/or the only certain types of low-latency games/applications are available.
Since this assessment and potential improvement of the link characteristics between the Hosting Service210and the Client Device205occurs while the temporary decoder is being downloaded, it reduces the amount of time that the client device205would need to spend separately downloading the temporary decoder2900and assessing the link characteristics. Nonetheless, in another embodiment, the assessment and potential improvement of the link characteristics is performed by the client device205separately from downloading the temporary decoder2900(e.g., by using dummy test data rather than the decoder program code). There are number of reasons why this may be a preferable implementation. For example, in some embodiments, the client device205is implemented partially or entirely in hardware. Thus, for these embodiments, there is no software decoder per se necessary to download.
Compression Using Standards-Based Tile Sizes
As mentioned above, when tile-based compression is used, the underlying principles of the invention are not limited to any particular tile size, shape, or orientation. For example, in a DCT-based compression system such as MPEG-2 and MPEG-4, tiles may be the size of macroblocks (components used in video compression which typically represent a block of 16 by 16 pixels). This embodiment provides a very fine level of granularity for working with tiles.
Moreover, regardless of tile size, various types of tiling patterns may be used. For example,FIG. 30illustrates an embodiment in which multiple I-tiles are used in each R frame3001-3004. A rotating pattern is used in which I-tiles are dispersed throughout each R frame so that a full I-frame is generated every four R frames. Dispersing the I-tiles in this manner will reduce the effects of a packet loss (limiting the loss to a small region of the display).
The tiles may also be sized to an integral native structure of the underlying compression algorithm. For example, if the H.264 compression algorithm is used, in one embodiment, tiles are set to be the size of H.264 “slices.” This allows the techniques described herein to be easily integrated the context of various different standard compression algorithms such as H.264 and MPEG-4. Once the tile size is set to a native compression structure, the same techniques as those described above may be implemented.
Techniques for Stream Rewind and Playback Operations
As previously described in connection withFIG. 15, the uncompressed video/audio stream1529generated by an app/game server1521-1525may be compressed by shared hardware compression1530at multiple resolutions simultaneously resulting in multiple compressed video/audio streams1539. For example, a video/audio stream generated by app/game server1521may be compressed at 1280×720×60 fps by the shared hardware compression1530and transmitted to a user via outbound routing1540as outbound Internet traffic1599. That same video/audio stream may be simultaneously scaled down to thumbnail size (e.g. 200×113) by the shared hardware compression1530via path1552(or through delay buffer1515) to app/game server1522to be displayed as one thumbnail1600of a collection of thumbnails inFIG. 16. When thumbnail1600is zoomed through intermediate size1700inFIG. 17to size1800(1280×720×60 fps) inFIG. 18, then rather than decompressing the thumbnail stream, app/game server1522can decompress a copy of the 1280×720×60 fps stream being sent to the user of app/game server1521, and scale the higher resolution video as it is zoomed from thumbnail size to 1280×720 size. This approach has the advantage of reutilizing the 1280×720 compressed stream twice. But it has several disadvantages: (a) the compressed video stream sent to the user may vary in image quality if the data throughput of the user's Internet connection varies resulting in a varying image quality viewed by the “spectating” user of app/game server1522, even if that user's Internet connection does not vary, (b) app/game server1522will have to use processing resources to decompress the entire 1280×720 image and then scale that image (and likely apply a re-sampling filter) to display much smaller sizes (e.g. 640×360) during the zoom, (c) if frames are dropped due to limited Internet connection bandwidth and/or lost/corrupted packets, and the spectating user “rewinds” and “pauses” the video recorded in the delay buffer1515, the spectating user will find the dropped frames are missing in the delay buffer (this will be particularly apparent if the user “steps” frame-by-frame), and (d) if the spectating user rewinds to find a particular frame in the video recorded in the delay buffer, then the app/game server1522will have to find an I frame or I tiles prior to the sought frame in the video stream recorded in the delay buffer, and then decompress all of the P frames/tiles until the desired frame is reached. This same limitations would not only apply to users “spectating” the video/audio stream live, but users (including the user that generated the video/audio stream) viewing an archived (e.g. “Brag Clip”) copy of the video/audio stream.
An alternative embodiment of the invention addresses these issues by compressing the video stream in more than one size and/or structure. One stream (the “Live” stream) is compressed optimally to stream to the end user, as described herein, based on the characteristics of the network connection (e.g. data bandwidth, packet reliability) and the user's local client capabilities (e.g., decompression capability, display resolution). Other streams (referred to herein as “HQ” streams) are compressed at high quality, at one or more resolutions, and in a structure amenable to video playback, and such HQ streams are routed and stored within the server center210. For example, in one embodiment, the HQ compressed streams are stored on a RAID disk array1515and are used to provide functions such as pause, rewind, and other playback functions (e.g., “Brag Clips” which may be distributed to other users for viewing).
As illustrated inFIG. 31a, one embodiment of the invention comprises an encoder3100capable of compressing a video stream in at least two formats: one which periodically includes I-Tiles or I-Frames3110and one which does not include I-Tiles or I-Frames3111, unless necessary due to a disruption of the stream or because an I-Tile or I-Frame is determined to likely be smaller than an I-Tile or I-Frame (as described above). For example, the “Live” stream3111transmitted to the user while playing a video game may be compressed using only P-Frames (unless I-Tiles or I-Frames are necessary or smaller as described above). In addition, the encoder3100of this embodiment concurrently compresses the Live video stream3111in a second format which, in one embodiment, periodically includes I-Tiles or I-Frames (or similar type of image format).
While the embodiments described above employ I-Tiles, I-Frames, P-Tiles and P-Frames, the underlying principles of the invention are not limited to any particular compression algorithm. For example, any type of image format in which frames are dependent on previous or subsequent frames may be used in place of P-Tiles or P-Frames. Similarly, any type of image format which is not dependent on previous or subsequent frames may be substituted in place of the I-Tiles or I-Frames described above.
As mentioned above, the HQ Stream3110includes periodic I-Frames (e.g., in one embodiment, every 12 frames or so). This is significant because if the user ever wants to quickly rewind the stored video stream to a particular point, I-Tiles or I-Frames are required. With a compressed stream of only P-Frames (i.e. without the first frame of the sequence being an I-Frame), it would be necessary for the decoder go back to the first frame of the sequence (which might be hours long) and decompress P frames up to the point to which the user wants to rewind. With an I-Frame every 12 frames stored in the HQ stream3110, the user can decide to rewind to a particular spot and the nearest preceding I-Frame of the HQ stream is no more than 12 frames prior to the desired frame. Even if the decoder maximum decode rate is real-time (e.g. 1/60thof a second for a 60 frame/sec stream), then 12 (frames)/60 (frames/sec)=⅕ second away from an I-Frame. And, in many cases, decoders can operate much faster than real-time so, for example, at 2× real-time a decoder could decode 12 frames in 6 frames, which is just 1/10thof a second delay for a “rewind”. Needless to say, that even a fast decoder (e.g. 10× real-time) would have an unacceptable delay if the nearest preceding I-Frame were a large number of frames previous to a rewind point (e.g. it would take 1 hour/10=6 minutes to do a “rewind”). In another embodiment, periodic I-Tiles are used, and in this case when the user seeks to rewind the decoder will find the nearest preceding I-Tile prior to the rewind point, and then commence decoding of that tile from that point until all tiles are decoded through to the rewind point. Although periodic I-Tiles or I-Frames result in less efficient compression than eliminating I-Frames entirely, the hosting service210typically has more than enough locally available bandwidth and storage capacity to manage the HQ stream.
In another embodiment the encoder3100encodes the HQ stream with periodic I-Tile or I-Frames, followed by P-Tiles or P-Frames, as previously described, but also preceded by B-Tiles or B-Frames. B-Frames, as described previously are frames that precede an I-Frame and are based on frame differences from the I-Frame working backwards in time. B-Tiles are the tile counterpart, preceding an I-Tile and based on frame differences working backwards from the I-Tile. In this embodiment, if the desired rewind point is a B-Frame (or contains B-Tiles), then the decoder will find the nearest succeeding I-Frame or I-Tile and decode backwards in time until the desired rewind point is decoded, and then as video playback proceeds from that point forward, the decoder will decode B-Frames, I-Frames and P-Frames (or their tile counterparts) in successive frames going forward. An advantage of employing B-Frames or B-Tiles in addition to I and P types is that, often higher quality at a given compression ratio can be achieved.
In yet another embodiment, the encoder3100encodes the HQ stream as all I-Frames. An advantage of this approach is that every rewind point is an I-Frame, and as a result, no other frames need to be decoded in order to reach the rewind point. A disadvantage is the compressed data rate will be very high compared to I, P or I, P, B stream encoding.
Other video stream playback actions (e.g. fast or slow rewind, fast or slow forward, etc.), typically are much more practically accomplished with periodic I-Frames or I-Tiles (alone or combined with P and/or B counterparts), since in each case the stream is played back in a different frame order than frame-by-frame forward in time, and as a result, the decoder needs to find and decode a particular, often arbitrary, frame in the sequence. For example, in the case of very fast-forward (e.g. 100× speed), each successive frame displayed is 100 frames after the prior frame. Even with a decoder that runs at 10× real-time and decodes 10 frames in 1 frame time, it would still be 10× too slow to achieve 100× fast-forward. Whereas, with periodic I-Frames or I-Tiles as described above, the decoder is able to seek the nearest applicable I-Frame or I-Tiles to the frame it needs to display next and only decode the intervening frames or tiles to the point of the target frame.
In another embodiment I-Frames are encoded in the HQ stream at a consistent periodicity (e.g. always each 8 frames) and the speed multipliers made available to the user for fast forward and rewind that are faster than the I-Frame rate are exact multiples of the I-Frame periodicity. For example, if the I-Frame periodicity is 8 frames, then the fast-forward or rewind speeds made available to the user might be 1×, 2×, 3×, 4×, 8×, 16×, 64× and 128× and 256×. For speeds faster than the I-Frame periodicity, the decoder will first jump ahead to the closest I-Frame that is the number of frames ahead at the speed (e.g., if the currently displayed frame is 3 frames prior to an I-Frame, then at 128×, the decoder would jump to a frame 128+3 frames ahead), and then for each successive frame the decoder would jump the exact number of frames as the chosen speed (e.g. at the chosen speed of 128×, the decoder would jump 128 frames) which would land exactly on an I-Frame each time. Thus, given that all speeds faster than the I-Frame periodicity are exact multiples of the I-Frame periodicity, the decoder will never need to decode any preceding or following frames to seek the desired frame, and only will have to decode one I-Frame per displayed frame. For speeds slower than the I-Frame periodicity (e.g. 1×, 2×, 3×, 4×), or for speeds faster that are non-multiples of the I-Frame periodicity, for each frame displayed, the decoder seeks whichever frames require the least additional newly decoded frames to display the desired frame, be it an undecoded I-Frame or an already-decoded frame still available in decoded form (in RAM or other fast storage), and then decode intervening frames, as necessary, until the desired frame is decoded and displayed. For example, at 4× fast forward, in an I,P encoded sequence with 8× I-Frame periodicity, if the current frame is a P-frame that is 1 frame following an I-frame, then the desired frame to be displayed is 4 frames later, which would be the 5thP-Frame following the preceding I-frame. If the currently displayed frame (which had just been decoded) is used as a starting point, the decoder will need to decode 4 more P-frames to display the desired frame. if the preceding I-Frame is used, the decoder will need to decode 6 frames (the I-Frame and the succeeding 5 P-Frames) in order to display the desired frame. (Clearly, in this case, it is advantageous to use the currently displayed frame to minimize the additional frames to decode). Then, the next frame to be decoded, 4 frames ahead, would be the 1stP-Frame following an I-Frame. In this case, if the currently decoded frame were used as a starting point, the decoder would need to decode 4 more frames (2 P-Frames, an I-Frame and a P-Frame). But, if the next I-Frame were used instead, the decoder would only need to decode the I-Frame and the successive P-Frame. (Clearly, in this case, it is advantageous to use the next I-Frame as a starting point to minimize the additional frames to decode.) Thus, in this example, the decoder would alternate between using the currently decoded frame as a starting point and using a subsequent I-Frame as a starting point. As a general principal, regardless of the HQ video stream playback mode (fast-forward, rewind or step) and speed, the decoder would start with whichever frame, be it an I-Frame or a previously decoded frame, requires the least number of newly decoded frames to display the desired frame for each successive frame displayed for that playback mode and speed.
As illustrated inFIG. 31b, one embodiment of the hosting service210includes stream replay logic3112for managing user requests to replay the HQ stream3110. The stream replay logic3112receives client requests containing video playback commands (e.g., pause, rewind, playback from a specified point, etc.), interprets the commands, and decodes the HQ stream3110from the specified point (either starting with either an I-Frame or previously decoded frame, as appropriate, and then proceeding forward or backward to the specified point). In one embodiment, a decoded HQ stream is provided to an encoder3100(potentially the self-same encoder3100, if capable of encoding more than one stream at once, or a separate encoder3100) so that it may be recompressed (using the techniques described herein) and transmitted to the client device205. The decoder3102on the client device then decodes and renders the stream as described above.
In one embodiment, the stream replay logic3112does not decode the HQ stream and then cause the encoder3100to re-encode the stream. Rather, it simply streams the HQ stream3110directly to the client device205from the specified point. The decoder3102on the client device205then decodes the HQ stream. Because the playback functions described herein do not typically have the same low-latency requirements as playing a real-time video game (e.g. if the player is simply reviewing prior gameplay; not actively playing), the added latency typically inherent in the usually higher-quality HQ stream may result in an acceptable end user experience (e.g., with higher latency but higher-quality video).
By way of example, and not limitation, if the user is playing a video game, the encoder3100is providing a Live stream of essentially all P-frames optimized for the user's connection and local client (e.g., approximately 1.4 Mbps at a 640×360 resolution). At the same time, the encoder3100is also compressing the video stream as an HQ stream3110within the hosting service310and storing the HQ stream on a local Digital Video Decoder RAID array at, for example, 1280×720 at 10 Mbps with I-frames every 12 frames. If the user hits a “Pause” button, then game will be paused on the client's last decoded frame and the screen will freeze. Then if the user hits a “Rewind” button, the stream replay logic3112will read the HQ stream3110from the DVR RAID starting from the closest I-frame or available already-decoded frame, as described above. The stream replay logic3112will decompress the intervening P or B frames, as necessary, re-sequence the frames as necessary so that the playback sequence is backwards at the desired rewind speed, and then resize (using prior art image scaling techniques well-known in the art) the desired decoded intended to be displayed from 1280×720 to 640×360, and the Live stream encoder3100will re-compress the re-sequenced stream at 640×360 resolution and transmit it to the user. If the user pauses again, and then single-steps through the video to watch a sequence closely, the HQ stream3110on the DVR RAID will have every frame available for single stepping (even though the original Live stream may have dropped frames for any of the many reasons described herein). Further, the quality of the video playback will be quite high at every point in the HQ stream, whereas there may be points in the Live stream where, for example, the bandwidth had been impaired, resulting in a temporary reduction in compressed image quality. While impaired image quality for a brief period of time, or in a moving image, may be acceptable for the user, if the user stops at a particular frame (or single-steps slowly) and studies frames closely, impaired quality may not be acceptable. The user is also provided with the ability to fast forward, or jump to a particular spot, by specifying a point within the HQ stream (e.g., 2 minutes prior). All of these operations would be impractical in their full generality and at high quality with a Live video stream that was P-frame-only or rarely (or unpredictably) had I-Frames.
In one embodiment, the user is provided with a video window (not shown) such as a Apple QuickTime or Adobe Flash video window with a “scrubber” (i.e., a left-right slider control) that allows the user to sweep forward and backward through the video stream, as far back as the HQ stream has stored the video. Although it appears to the user as if he or she is “scrubbing” through the Live stream, in fact he or she is scrubbing through the stored HQ stream3110, which is then resized and recompressed as a Live stream. In addition, as previously mentioned, if the HQ stream is watched by anyone else at the same time, or the user at a different time, it can be watched a higher (or lower) resolution than the Live stream's resolution while the HQ stream is simultaneously encoded, and the quality will be as high as the quality of the viewer's Live stream, potentially up to the quality of the HQ stream.
Thus, by simultaneously encoding both the Live stream (as described herein in an appropriate manner for its low-latency, bandwidth and packet error-tolerance requirements) and an HQ stream with its high-quality, stream playback action requirements, the user is thereby provided with desired configuration of both scenarios. And, in fact, it is effectively transparent to the user that there are two different streams being encoded differently. From the user's perspective, the experience is highly responsive with low-latency, despite running on a highly variable and relatively low bandwidth Internet connection, yet the Digital Video Recording (DVR) functionality is very high quality, with flexible actions and flexible speeds.
As a result of the techniques described above, the user receives the benefits of both Live and HQ video streams during online game play, or other online interaction, without suffering from any of the limitations of either a Live stream or an HQ stream.
FIG. 31cillustrates one embodiment of a system architecture for performing the above operations. As illustrated, in this embodiment, the encoder3100encodes a series of “Live” streams3121L,3122L, and3125L and a corresponding series of “HQ” streams3121H1-H3,3122H1-H3, and3125H1-H3, respectively. Each HQ stream H1is encoded at full resolution, while each encoder H2, and H3scales to the video stream to a smaller size prior to encoding. For example, if the video stream were 1280×720 resolution, H1would encode at 1280×720 resolution, while H2could scale to 640×360 and encode at that resolution and H3could scale to 320×180 and encode at that resolution. Any number of simultaneous Hn scaler/encoders, where n is an integer greater than 1, could be used, providing multiple simultaneous HQ encodings at a variety of resolutions.
Each of the Live streams operate in response to channel feedback signals3161,3162, and3165received via an inbound Internet connection3101, as described above (see, e.g., the discussion of feedback signals2501and2601inFIGS. 25-26). The Live streams are transmitted out over the Internet (or other network) via outbound routing logic3140. The Live compressors3121L-3125L include logic for adapting the compressed video streams (including scaling, dropping frames, etc.) based on channel feedback.
The HQ streams are routed by inbound routing logic3141and1502to internal delay buffers (e.g., RAID array3115) or other data storage devices via signal path3151and/or are fed back via signal path3152into app/game servers and encoder3100for additional processing. As described above, the HQ streams3121Hn-3125Hn are subsequently streamed to end users upon request (see, e.g.,FIG. 31band associated text).
In one embodiment, the encoder3100is implemented with the shared hardware compression logic1530shown inFIG. 15. In another embodiment, some or all of the encoders and scalers are individual subsystems. The underlying principles of the invention are not limited to any particular sharing of scaling or compression resources or hardware/software configuration.
An advantage of the configuration ofFIG. 31cis that App/Game Servers3121-3125that require smaller that full-size video windows will not need to process and decompress a full-size window. Also, App/Game Services3121-3125that require in-between window sizes can receive a compressed stream that is near the desired window size, and then scale up or down to the desired window size. Also, if multiple App/Game Servers3121-3125request the same size video stream from another App/Game Server3121-3125, Inbound Routing3141can implement IP multicast techniques, such as those well-known in the art, and broadcast the requested stream to multiple App/Game Servers3121-3125at once, without requiring an independent stream to each App/Game Server making a request. If an App/Game server receiving a broadcast changes the size of a video window, it can switch over to the broadcast of a different video size. Thus, an arbitrarily large number of users can simultaneously view a App/Game Server video stream, each with the flexibility of scaling their video windows and always getting the benefit of a video stream scaled closely to the desired window size.
One disadvantage with the approach shown inFIG. 31cis that in many practical implementations of the Hosting Service210, there is never a time when all of the compressed HQ streams, let alone all of the sizes of all of the compressed HQ streams, are viewed at once. When encoder3100is implemented as a shared resource (e.g. a scaler/compressor, either implemented in software or hardware), this wastefulness is mitigated. But, there may be practical issues in connecting a large number of uncompressed streams to a common shared resource, due to the bandwidth involved. For example, each 1080p60 stream is almost 3 Gbps, which is far in excess of even Gigabit Ethernet. The following alternate embodiments address this issue.
FIG. 31dshows an alternative embodiment of the Hosting Service210in which each App/Game Server3121-3125has two compressors allocated to it: (1) a Live stream compressor3121L-3125L, that adapts the compressed video stream based on Channel Feedback3161-3165, and (2) an HQ stream compressor that outputs a full-resolution HQ stream, as described above. Notably, the Live compressor is dynamic and adaptive, utilizing two-way communications with the client205, while the HQ stream is non-adaptive and one-way. Other differences between the streams are the Live stream quality may vary dramatically, depending on the channel conditions and the nature of the video material. Some frames may be of poor quality, and there may be dropped frames. Also, the Live stream may be almost entirely P-frames or P-tiles, with I-frames or I-tiles appearing infrequently. The HQ stream typically will be much higher data rate than the Live Stream, and it will provide consistent high-quality, without dropping any frames. The HQ stream may be all I-frames, or may have frequent and/or regular I-frames or I-tiles. The HQ stream may also include B-frames or B-tiles.
In one embodiment, Shared video scaling and recompression3142(detailed below) selects only certain HQ video streams3121H1-3125H1to be scaled and recompressed at one or more different resolutions, before sent to Inbound Routing3141for routing as previously described. The other HQ video streams are either passed through at their full size to Inbound Routing3141for routing as previously described, or not passed through at all. In one embodiment, the decision on which HQ streams are scaled and recompressed and/or which HQ streams are passed through at all is determined based on whether there is a App/Game Server3121-3125that is requesting that particular HQ stream at the particular resolution (or a resolution close to the scaled or full resolution). Through this means, the only HQ streams that are scaled and recompressed (or potentially passed through at all) are HQ streams that are actually needed. In many applications of Hosting Service210, this results in a dramatic reduction of scaling and compression resources. Also, given that every HQ stream is at least compressed at its full resolution by compressors3121H1-3125H1, the bandwidth needed to be routed to and within Shared video scaling and recompression3142is dramatically reduced than if it would be accepted uncompressed video. For example, a 3 GBps uncompressed 1080p60 stream could be compressed to 10 Mbps and still retain very high quality. Thus, with Gigabit Ethernet connectivity, rather than be unable to carry even one uncompressed 3 Gbps video stream, it would be possible to carry dozens of 10 Mbps video streams, with little apparent reduction in quality.
FIG. 31fshows details of Shared video scaling and recompression3142, along with a larger number of HQ video compressors HQ3121H1-3131H1. Internal routing3192, per requests for particular video streams scaled to particular sizes from the App/Game Servers3121-3125, selects typically a subset of compressed HQ streams from HQ video compressors HQ3121H1-3131H1. A stream within this selected subset of streams is routed either through a Decompressor3161-3164if the stream requested is to be scaled, or routed on Non-scaled Video path3196if the stream requested is at full resolution. The streams to be scaled are decompressed to uncompressed video by Decompressors3161-3164, then each scaled to the requested size by Scalers3171-3174, then each compressed by Compressor3181-3184. Note that if a particular HQ stream is requested at more than one resolution, then Internal Routing3192multicasts that stream (using IP multicasting technology that is well-known by practitioners in the art) to one or more Decompressors3161-3164and (if one a requested size if full resolution) to Outbound Routing3193. All of the requested streams, whether scaled (from Compressors3181-3184) or not (from Internal Routing3192), are then sent to Outbound Routing3193. Routing3193then sends each requested stream to the App/Game Server3121-3125that requested it. In one embodiment, if more than one App/Game server requests the same stream at the same resolution, then Outbound Routing3193multicasts the stream to all of the App/Game servers3121-3125that are making the request.
In the presently preferred embodiment of the Shared video scaling and recompression3142, the routing is implemented using Gigabit Ethernet switches, and the decompression, scaling, and compression is implemented by discrete specialized semiconductor devices implementing each function. The same functionality could be implemented with a higher level of integration in hardware or by very fast processors.
FIG. 31eshows another embodiment of Hosting Service210, where the function of Delay Buffer3115, previously described, is implemented in a Shared video delay buffer, scaling and decompression subsystem3143. The details of subsystem3143are shown inFIG. 31g. The operation of subsystem3143is similar to that of subsystem3142shown inFIG. 31f, except3191first selects which HQ video streams are to be routed, per requests from App/Game Servers3121-3125, and then, the HQ streams that are requested to be delayed are routed through Delay Buffer3194, implemented as a RAID Array in the presently preferred embodiment (but could be implemented in any storage medium of sufficient bandwidth and capacity), and streams that are not requested to be delayed are routed through Non-delayed Video path3195. The output of both the Delay Buffer3194and Non-delayed Video3195is then routed by Internal Routing3192based on whether requested streams are to be scaled or not scaled. Scaled streams are routed through Decompressors3161-3164, Scalers3171-3174and Compressors3181-3184to Outbound Routing3193. Non-scaled Video3196is also sent to Outbound Routing3193, and then Outbound Routing3193then sends the video in unicast or multicast mode to App/Game Servers in the same manner as previously described in subsystem3142ofFIG. 31f.
Another embodiment of video delay buffer, scaling and decompression subsystem3143is shown inFIG. 31h. In this embodiment, an individual Delay Buffer HQ3121D-HQ3131D is provided for each HQ stream. Given the rapidly declining cost of RAM and Flash ROM, which can be used to delay an individual compressed video stream, this may end up being less expensive and/or more flexible than having a shared Delay Buffer3194. Or, in yet another embodiment, a single Delay Buffer3197(shown in dotted line) can provide delay for all of the HQ streams individually in a high-performance collective resource (e.g. very fast RAM, Flash or disk). In either scenario, each Delay Buffer HQ3121D-3131D is able to variably delay a stream from the HQ video source, or pass the stream through without delay. In another embodiment, each delay buffer is able to provide multiple streams with different delay amounts. All delays or non-delays are requested by App/Game Services3121-3125. In all of these cases Delayed and Non-Delayed Video streams3198are sent to Internal Routing3192, and proceeds through the rest of the subsystem3143as previously described relative toFIG. 31g.
In the preceding embodiments relative to the variousFIG. 31nnote that the Live stream utilizes a two-way connection and is tailored for an particular user, with minimal latency. The HQ streams utilize one-way connections and are both unicast and multicast. Note that while the multicast function is illustrated in these Figures as a single unit, such as could be implemented in a Gigabit Ethernet switch, in a large scale system, the multicast function would likely be implemented through a tree of multiple switches. Indeed, in the case of a video stream from a top-ranked video game player, it may well be the case that the player's HQ stream is watched by millions of users simultaneously. In such a case, there would likely be a large number of individual switches in successive stages broadcasting the multicasted HQ stream.
For both diagnostic purposes, and so as to provide feedback to the user (e.g. to let the user know how popular his gameplay performance is), in one embodiment, the hosting service210would keep track of how many simultaneous viewers there are of each App/Game Server3121-3125's video stream. This can be accomplished by keeping a running count of the number of active requests by App/Game servers for a particular video stream. Thus, a gamer who has 100,000 simultaneous viewers will know that his or her gameplay is very popular, and it will create incentive for game players to do a better performance and attract viewers. When there is very large very viewership of video streams (e.g. of a championship video game match), it may be desirable for commentators to speak during the video game match such that some or all users watching the multicast can hear their commentary.
Applications and Games running on the App/Game servers will be provided with an Application Program Interface (API) in which the App and/or Game can submit requests for particular video streams with particular characteristics (e.g. resolution and amount of delay). Also, these APIs, submitted to an operating environment running on the App/Game Server, or to a Hosting Service Control System401ofFIG. 4amay reject such requests for a variety of reasons. For example, the video stream requested may have certain licensing rights restrictions (e.g. such that it can only be viewed by a single viewer, not broadcast to others), there may be subscription restrictions (e.g. the viewer may have to pay for the right to view the stream), there may be age restrictions (e.g. the viewer may have to be 18 to view the stream), there may be privacy restrictions (e.g. the person using the App or playing the game may limit viewing to just a selected number or class of viewers (e.g. his or her “friends”), or may not allow viewing at all), and there may be restrictions requiring the material is delayed (e.g. if the user is playing a stealth game where his or her position might be revealed). There is any number of other restrictions that might limit viewing of the stream. In any of these cases, the request by the App/Game server would be rejected with a reason for the rejection, and in one embodiment, with alternatives by which the request would be accepted (e.g. stating what fee must be paid for a subscription).
HQ video streams that are stored in Delay Buffers in any of the preceding embodiments may be exported to other destinations outside of the Hosting Service210. For example, a particularly interesting video stream can be requested by an App/Game server (typically by the request of a user), to be exported to YouTube. In such a case, the video stream would be transmitted through the Internet in a format agreed-upon with YouTube, together with appropriate descriptive information (e.g. the name of the user playing, the game, the time, the score, etc.). This could be implemented by multicasting in a separate stream the commentary audio to all of the Game/App Servers3121-3125requesting such commentary. The Game/App Servers would merge the audio of the commentary, using audio mixing techniques well-known by practitioners in the art, into the audio stream sent to the user premises211. There could well be multiple commentators (e.g. with different viewpoints, or in different languages), and users could select among them.
In a similar manner, separate audio streams could be mixed in or serve as replacement for the audio track of particular video streams (or individual streams) in the Hosting Service210, either mixing or replacing audio from video streaming in real-time or from a Delay Buffer. Such audio could be commentary or narration, or it could provide voices for characters in the video stream. This would enable Machinima (user-generation animations from video game video streams) to be readily created by users.
The video streams described throughout this document are shown as captured from the video output of App/Game servers, and then being streamed and/or delay and being reused or distributed in a variety of ways. The same Delay Buffers can be used to hold video material that has come from non-App/Game server sources and provide the same degree of flexibility for playback and distribution, with appropriate restrictions. Such sources include live feeds from television stations (either over-the-air, or non-over-the-air, such as CNN, and either for-pay, such as HBO, or free). Such sources also include pre-recorded movies or television shows, home movies, advertisements and also live video teleconference feeds. Live feeds would be handled like the live output of a Game/App Server. Pre-recorded material would be handled like the output of a Delay Buffer.
Video Window Zooming and Translation
FIGS. 16, 17 and 18shows a user interface through the stages of zooming a thumbnail video window1600to medium size1700, and finally to full size video window1800. The zoomed video windows1600,1700and1800are generated by app/game server1521-1525ofFIG. 15, as described previously. If the user is viewing video windows1600,1700and1800on a typical desktop monitor or TV screen, once a window has been zoomed to full size, the image is often large enough for the user to see the content of the window in sufficient detail for the user's application. For example, the content shown in video window1800is a video game, and if the display device is a 32″ TV set, or a 23″ computer monitor, then if the user is within typically viewing distance for the display device, the user will be able to make out sufficient details in the video so as to be able to play the game. As an example, the user will be able to see there is a car1802up ahead (and perhaps avoid it, run into it, follow it, etc. depending on the game) and the user will be able to discern details of map1803and speedometer1804with sufficient clarity so as to be able to play the game.
If the user's display device is very small, however, then it may be the case that the video window1800is so small that the user is unable to discern sufficient details in the content for the user's application (or due to limitations of vision). For example, if the display device is a small cellular phone, such as an Apple® iPhone, or a small media player, such as an Apple iPod Touch, the very small screen size and/or low screen resolution may make it difficult or impossible to make out car1802, or read details of map1803or speedometer1804. If so, it may be difficult or impossible to play the game.
In one embodiment, server1521-1525is responsive to user input to zoom video window1800to be larger than the full size of the screen and/or the server1521-1525is responsive to user input to translate horizontally and/or vertically and/or diagonally the zoomed-up screen. This allows the user to zoom in on particular areas of the screen so as to view them in more detail. For example, if the user zooms in to the center of the screen, car1802can become big enough to be discernable. If the user zooms into the screen and translates the image so that the lower left is visible, then map1803can become discernable. If the user zooms into the lower right of the screen, then speedometer1804can become discernable. Depending on the nature of the game or application, the user may want to pause the video either before, during or after zooming/translating by performing a user interface action that indicates to server1521-1525that the user desires to pause. For example, in a driving game, the user may not pause while zoomed into the center of the screen and focused on the road, but may pause before viewing the map1803or speedometer1804, so as not to crash the car while the road ahead is no longer visible on the screen. In the case of a productivity application or a web browser application, it may not be necessary to pause the video for zooming or translation.
There are various prior art user interface techniques that are used for scaling and/or translating prior art windows, such a “pinching”, “spreading” and/or “swiping” a touch screen or track pad with the fingers, e.g. on an Apple iPhone, iPod Touch or Macintosh. As another example, the scroll wheel on a mouse can be used to specify how much to zoom, while clicking the mouse and moving it can be uses to specify translation. These techniques and others can be applied to embodiments of video window1800of the present invention so that the user can specify zoom and/or translation. The specified scaling and/or translation user input is sent as Control Signals406inFIG. 4ato server1521-1525, and the requested scaling and/or translation action is performed. In one embodiment, to the extent the scaled and/or translated video window would extend beyond the edge of the display device, it is cropped by server1521-1525to the edges of the display device (or the edges of a window containing the zoomed portion of the image), so that only the visible pixels are compressed by Shared Hardware Compression1530are sent to Home or Office client device415inFIG. 4a, so as to reduce transmission bandwidth. In another embodiment, some or all of the zoomed and/or translated video window that would extend beyond the edges of the display device (or the edges of a window containing the zoomed portion of the image) is generated by server1521-1525and is compressed by Shared Hardware Compression1530and sent to Home or Office client device415, and then is cropped to the edges of the display device or the edges of a window containing the zoomed portion of the image). Although the embodiment of the preceding sentence results in higher transmission bandwidth than minimally necessary, the scaling and/or translating operation is carried out locally by Home or Office client device415, which may reduce the latency of the response time of the zooming and/or translation operation, particularly if the connection between Home or Office client device415and Hosting Service210is incurring noticeable latency.
In one embodiment, the scaled and/or translated video window is a Web browser window. Prior art devices with small screens, for example, cellular phones or media players, such as the Apple iPhone or iPod Touch, have Web browsers which have the facility to zoom and/or translate the window of the displayed web site (e.g. by “pinching”, “spreading” and/or “swiping”) so as view the content more closely to discern more detail or view it with low detail in its entirety.
In a practical context, there are significant disadvantages using such prior art Web browsers, often resulting in slow Web page loading or inefficient usage of network resources. If the user is zoomed into a Web page, typically the browser still needs to download most, if not all, of the Web page content because it is typically necessary for the Web browser to process and display all or most of the elements of the Web page before it can determine what the resulting image will be in the zoomed-in portion of the Web page. As one simple example, if the Web browser is zoomed into a portion of a large jpeg image, the jpeg compression algorithm typically will require the entire image to be decompressed before a portion of the image can be displayed. As another example, HTML pages often implicitly or explicitly position certain elements of a Web page relative to the position of other elements, requiring elements that may be off the edge of a zoomed-in page to be downloaded and parsed before the position of the elements of the zoomed-in portion is known.
Further, as described above and illustrated inFIG. 24, the load time of websites asymptotically approaches the sum of the load time incurred in loading each file that makes up the website to the extent the loading of the files cannot be overlapped. And, also, many elements, such as jpeg images for rollovers, have to be loaded, but may never be displayed.
FIG. 24illustrates the load time at various connection speeds of a 54-file Web site where each file has a latency overhead of 100 ms, which is a common latency for HTTP files using a wireline Internet connection. If a cellular network is used, for example, the AT&T 3G cellular network, the HTTP file latency overhead is typically much higher—perhaps as high as 400 ms or more. So, while the example website ofFIG. 24asymptotically approaches 5.4 seconds of load time, with 400 ms HTTP file latency, the same website would asymptotically approach 21.6 seconds of load time, and given that 3G cellular networks may well operate at lower connection speeds than wireline connections, the total load time may be 30 seconds or longer. And, only then will the Web browser be able to display a Web page, even if only a zoomed-in portion is being viewed.
This is not only an inconvenience to the user, but it is a very inefficient use of wireless data throughput, which often is shared among many users, and often less (frequently far less) wireless data throughput is available than there is demand for.
As previously described, in one embodiment a Web browser is hosted in server1521-1525, sourcing Web page files stored locally in Hosting Service210, or sourcing files by utilizing a connection to the Internet available in Hosting Service210. The Web browser window is then transmitted to Home or Office Client device415for display. The Web browser is controlled by the user's actions via Control Signals406. If the Home or Office Client device415is a cellular phone operating through a cellular network, the latency will typically be much higher between the cell phone and the Internet than between the Hosting Service210and the Internet. So, whether the Web page files are stored locally in the Hosting Service210or located on remote Web servers, it is highly likely that the user will experience much lower delay in viewing a Web page generated by a Web Browser hosted in server1521-1525and transmitted to the cellular phone as video, than the delay the user would experience if the same Web page were generated by a Web Browser running locally on the cellular phone, with Web page files transmitted conventionally through the cellular network. Further, as previously described, hosting the Web Browser in server1521-1525would likely result in much less data being transmitted for a given Web page to the Home or Office Client device415. Also, if the cellular network connection speed degrades, various techniques previously described can be used to dynamically reduce the transmitted data throughput so as to not exceed the capacity of the cellular channel. Additionally, this approach eliminates the need for keeping a local Web browser up-to-date (e.g. with the latest version of Adobe Flash), since the Web browser running in server1521-1525can be kept up-to-date. Indeed, the Web browser running in server1521-1525may be able to perform operations that are beyond the computational capabilities of Home or Office Client device415.
In one embodiment, the video window generated on the Home or Office or Client device415can be zoomed and/or translated as previously described. In this way, the user can either zoom and/or translate to view detail in the Web page, or if zooming out, get an overall view of the Web page. Indeed, it may well be the case that the Web Page image is zoomed down in size to only take up a portion of the screen, such as the windows1600and1700, and indeed, may be viewed simultaneously with other windows, whether they are other Web pages, or whether they are video windows of video games, videos, applications, etc.
Some prior art Web browsers, such as Apple Safari, can display windows of multiple reduced-sized websites at once, for example, to show which websites the user has visited most frequently. Because the multiple websites are displayed simultaneously, it is typically impractically to display all of them dynamically (e.g. showing what is presently displayed live on the websites) because, among other issues, the sum of the bandwidth demands from all of the websites may be excessive. For example, if 8 websites are showing video, the full video bandwidth from all websites must be received, decompressed and then scaled down to the size of reduced-sized websites. This not only results in a non-real-time experience for the user, but it is wasteful of bandwidth since the entire websites need to be downloaded, and then they are scaled down, with much of the detail lost.
In one embodiment, one or more Web browsers are hosted on server1521-1525and the video is scaled before being sent to Office or Client device415, so as to appear as multiple scaled-down video windows, such as those shown inFIG. 16. In contrast to prior art Web browsers that can display multiple Web sites at once as reduced-size windows, in this embodiment, the Web sites can all be displayed live at once, even if some or all of them incorporated high bandwidth elements, like video. And, these multiple live website windows can be simultaneously displayed with other live video windows, such as video games, videos, applications, etc.
In one embodiment, the number of windows available for viewing by the user is larger than the number of video windows displayed within the bounds of the display device. For example, inFIG. 16, a 6×3 array of 18 video windows is visible, but there may well be far more video windows available for viewing by the user (e.g. a 20×20 array), of which only the 6×3 subset is visible within the bounds of the display device at a given time. In one embodiment, the user is able to effectuate a translation (e.g. horizontally, vertically and/or diagonally) of the array of video windows by “swiping” a finger across a touch screen or a track pad, thus creating the illusion that the finger touch is causing the array of video windows to translate its position and reveal other video windows. This can be implemented by the finger swipe control information (e.g. position of the swipe, velocity of the swipe, etc.) being sent as Control Signals406, and the app/game servers1521-1525implementing an animation effect showing the motion of the array of video windows. In one embodiment, the animation effect is a translation showing the video wall moving horizontally, vertically or at an angle in response to the finger swipe (or other user input, such as controller action or keyboard/mouse presses, body motion, etc.), but with the video windows all remaining the same size as they move. In another embodiment, the animation effect is a non-rectilinear motion, where the video wall moves with a complex motion in response to a finger swipe (or other user input actions), and some or all of the video windows change size during the animation effect. One such complex motion is a perspective 3D motion, creating the illusion that the video wall is located in 3D space.
In one embodiment, the various functional modules illustrated herein and the associated steps may be performed by specific hardware components that contain hardwired logic for performing the steps, such as an application-specific integrated circuit (“ASIC”) or by any combination of programmed computer components and custom hardware components.
In one embodiment, the modules may be implemented on a programmable digital signal processor (“DSP”) such as a Texas Instruments' TMS320x architecture (e.g., a TMS320C6000, TMS320C5000, . . . , etc.). Various different DSPs may be used while still complying with these underlying principles.
Embodiments may include various steps as set forth above. The steps may be embodied in machine-executable instructions which cause a general-purpose or special-purpose processor to perform certain steps. Various elements, which are not relevant to these underlying principles such as computer memory, hard drive, input devices, etc., have been left out of some or all of the figures to avoid obscuring the pertinent aspects.
Elements of the disclosed subject matter may also be provided as a machine-readable medium for storing the machine-executable instructions. The machine-readable medium may include, but is not limited to, flash memory, optical disks, CD-ROMs, DVD ROMs, RAMs, EPROMs, EEPROMs, magnetic or optical cards, or other type of machine-readable media suitable for storing electronic instructions.
It should also be understood that elements of the disclosed subject matter may also be provided as a computer program product which may include a machine-readable medium having stored thereon instructions which may be used to program a computer (e.g., a processor or other electronic device) to perform a sequence of operations. Alternatively, the operations may be performed by a combination of hardware and software. The machine-readable medium may include, but is not limited to, floppy diskettes, optical disks, CD-ROMs, and magneto-optical disks, ROMs, RAMs, EPROMs, EEPROMs, magnet or optical cards, or other type of machine-readable medium suitable for storing electronic instructions.
Additionally, although the disclosed subject matter has been described in conjunction with specific embodiments, numerous modifications and alterations are well within the scope of the present disclosure. Accordingly, the specification and drawings are to be regarded in an illustrative rather than restrictive sense.
Claims
- A method for streaming a video game from a server to a client, comprising, generating, at the server, video frames for the video game responsive to input received from the client;encoding, at the server, the video frames to generate compressed video frames, and storing past encoder states;sending, by the server, the compressed video frames, the compressed video frames are configured to be decoded by the client, the client is configured to store past decoder states;receiving, by the server from the client, a feedback signal used by the server to identify one or more of the compressed video frames have not been received;and encoding, by the server, one or more next video frames to generate compressed video frames, wherein the encoding is updated based on adjusted frame dependencies to use compressed video frames that are known to have been successfully received based on the feedback signal;wherein said encoding of the one or more next video frames of the compressed video frames does not require reference to a new I frame, and wherein when the server determines that the updated encoding that using a compressed video frame known to have been successfully received is taking longer to generate than the new I frame, stopping the encoding and instead proceed to generate the new I frame.
- The method of claim 1 , wherein generating compressed video frames that are dependent on compressed video frames that are known to have been successfully received based on the feedback signal are one or more P frames, and said P frames are generated instead of generating said new I frame.
- The method of claim 1 , wherein the past decoder states are stored in memory associated with the decoder.
- The method of claim 1 , wherein the past encoder states are stored in memory associated with the encoder.
- The method of claim 1 , wherein the past decoder states are stored in memory associated with the decoder and the past encoder states are stored in memory associated with the encoder.
- The method of claim 1 , wherein the past decoder states and the past encoder states include data associated with one or more P frames.
- The method of claim 1 , wherein storing the past encoder and decoder states enables utilization of data associated with P frames that are known to have been successfully received to avoid causing generation of said new I frame responsive to one or more lost frames that are not received by the client.
- The method of claim 1 , wherein said feedback signal is sent by the decoder directly to the encoder.
- The method of claim 1 , wherein said server is part of a hosting server.
- The method of claim 1 , wherein said feedback signal reports missing receipt of a frame or indicates delay in receiving a frame that will be dropped upon being received late.
- A system for streaming a video game from a server to a client, comprising, the server is configured to generate video frames for the video game responsive to input received from the client;an encoder that processes the video frames to generate compressed video frames, and storing past encoder states in memory associated with the encoder, wherein the server transmits the compressed video frames to the client, wherein the server is configured to receive a feedback signal from the client to determine when one or more of the compressed video frames that were sent were not been received by the client, wherein the encoder is configured to adjust frame dependencies before generating one or more next video frames as compressed video frames that are dependent on compressed video frames that are known to have been successfully received based on the feedback signal received from said client, and said adjusted frame dependencies does not require generating a new I frame;wherein when the server determines that the updated encoding that using a compressed video frame known to have been successfully received is taking longer to generate than the new I frame, stopping the encoding and instead proceed to generate the new I frame.
- The system of claim 11 , wherein generating compressed video frames that are dependent on compressed video frames that are known to have been successfully received based on the feedback signal are one or more P frames, and said P frames are generated instead of generating said new I frame.
- The system of claim 11 , wherein the past encoder states are stored in memory associated with the encoder.
- The system of claim 11 , wherein storing the past encoder states enables utilization of data associated with P frames that are known to have been successfully received to avoid causing generation of said new I frame responsive to one or more lost frames that are not received by the client.
- The system of claim 11 , wherein said feedback signal is received by the encoder directly from a decoder of the client.
- The system of claim 11 , wherein said server is part of a hosting server.
- The method of claim 11 , wherein said feedback signal reports missing receipt of a frame or indicates delay in receiving a frame that will be dropped upon being received late.
Disclaimer: Data collected from the USPTO and may be malformed, incomplete, and/or otherwise inaccurate.