U.S. Pat. No. 7,674,181

GAME PROCESSING

AssigneeSony Interactive Entertainment Europe Ltd

Issue DateAugust 31, 2005

Illustrative Figure

Abstract

Game processing apparatus comprising: an indication generator operable to generate an indication of a sequence of target actions to be executed by a user; a detector operable to detect user actions executed by said user; comparison logic operable to compare said target actions with said user actions to determine whether said target actions have been successfully completed by said user actions; and scoring logic operable, when in a non-scoring mode, to detect a first pattern of target actions that have been successfully completed with respect to target actions that have not been successfully completed, as determined by said comparison logic, said scoring logic entering a scoring mode upon detection of said first pattern, said scoring logic operable, when in said scoring mode, to detect a second pattern of target actions that have been successfully completed with respect to target actions that have not been successfully completed, as determined by said comparison logic, said scoring logic entering said non-scoring mode upon detection of said second pattern, said scoring logic operable to generate a score for said user when said scoring logic is in said scoring mode, said score being dependent upon said determination by said comparison logic.

Description

DESCRIPTION OF THE PREFERRED EMBODIMENTS FIG. 1schematically illustrates the overall system architecture of a PlayStation2 machine. A system unit10is provided, with various peripheral devices connectable to the system unit. It will be appreciated that the PlayStation2 machine is just an example of such a system unit10and that other embodiments may make use of other system units. The system unit10comprises: an Emotion Engine100; a Graphics Synthesiser200; a sound processor unit300having dynamic random access memory (DRAM); a read only memory (ROM)400; a compact disc (CD) and digital versatile disc (DVD) reader450; a Rambus Dynamic Random Access Memory (RDRAM) unit500; an input/output processor (IOP)700with dedicated RAM750. An (optional) external hard disk drive (HDD)390may be connected. The input/output processor700has two Universal Serial Bus (USB) ports715and an iLink or IEEE 1394 port (iLink is the Sony Corporation implementation of the IEEE 1394 standard). The IOP700handles all USB, iLink and game controller data traffic. For example when a user is playing a game, the IOP700receives data from the game controller and directs it to the Emotion Engine100which updates the current state of the game accordingly. The IOP700has a Direct Memory Access (DMA) architecture to facilitate rapid data transfer rates. DMA involves transfer of data from main memory to a device without passing it through the CPU. The USB interface is compatible with Open Host Controller Interface (OHCI) and can handle data transfer rates of between 1.5 Mbps and 12 Mbps. Provision of these interfaces means that the PlayStation2 machine is potentially compatible with peripheral devices such as video cassette recorders (VCRs), digital cameras, microphones, set-top boxes, printers, keyboard, mouse and joystick. Generally, in order for successful data communication to occur with a peripheral device connected to a USB port715, an appropriate piece of software such as a device driver should be provided. Device driver technology is ...

DESCRIPTION OF THE PREFERRED EMBODIMENTS

FIG. 1schematically illustrates the overall system architecture of a PlayStation2 machine. A system unit10is provided, with various peripheral devices connectable to the system unit. It will be appreciated that the PlayStation2 machine is just an example of such a system unit10and that other embodiments may make use of other system units.

The system unit10comprises: an Emotion Engine100; a Graphics Synthesiser200; a sound processor unit300having dynamic random access memory (DRAM); a read only memory (ROM)400; a compact disc (CD) and digital versatile disc (DVD) reader450; a Rambus Dynamic Random Access Memory (RDRAM) unit500; an input/output processor (IOP)700with dedicated RAM750. An (optional) external hard disk drive (HDD)390may be connected.

The input/output processor700has two Universal Serial Bus (USB) ports715and an iLink or IEEE 1394 port (iLink is the Sony Corporation implementation of the IEEE 1394 standard). The IOP700handles all USB, iLink and game controller data traffic. For example when a user is playing a game, the IOP700receives data from the game controller and directs it to the Emotion Engine100which updates the current state of the game accordingly. The IOP700has a Direct Memory Access (DMA) architecture to facilitate rapid data transfer rates. DMA involves transfer of data from main memory to a device without passing it through the CPU. The USB interface is compatible with Open Host Controller Interface (OHCI) and can handle data transfer rates of between 1.5 Mbps and 12 Mbps. Provision of these interfaces means that the PlayStation2 machine is potentially compatible with peripheral devices such as video cassette recorders (VCRs), digital cameras, microphones, set-top boxes, printers, keyboard, mouse and joystick.

Generally, in order for successful data communication to occur with a peripheral device connected to a USB port715, an appropriate piece of software such as a device driver should be provided. Device driver technology is very well known and will not be described in detail here, except to say that the skilled man will be aware that a device driver or similar software interface may be required in the embodiment described here.

In the present embodiment, a USB microphone730is connected to the USB port. It will be appreciated that the USB microphone730may be a hand-held microphone or may form part of a head-set that is worn by the human operator. The advantage of wearing a head-set is that the human operator's hand are free to perform other actions. The microphone includes an analogue-to-digital converter (ADC) and a basic hardware-based real-time data compression and encoding arrangement, so that audio data are transmitted by the microphone730to the USB port715in an appropriate format, such as 16-bit mono PCM (an uncompressed format) for decoding at the PlayStation2 system unit10.

Apart from the USB ports, two other ports705,710are proprietary sockets allowing the connection of a proprietary non-volatile RAM memory card720for storing game-related information, a hand-held game controller725or a device (not shown) mimicking a hand-held controller, such as a dance mat.

The system unit10may be connected to a network adapter805that provides an interface (such as an Ethernet interface) to a network. This network may be, for example, a LAN, a WAN or the Internet. The network may be a general network or one that is dedicated to game related communication. The network adapter805allows data to be transmitted to and received from other system units10that are connected to the same network, (the other system units10also having corresponding network adapters805).

The Emotion Engine100is a 128-bit Central Processing Unit (CPU) that has been specifically designed for efficient simulation of 3 dimensional (3D) graphics for games applications. The Emotion Engine components include a data bus, cache memory and registers, all of which are 128-bit. This facilitates fast processing of large volumes of multi-media data. Conventional PCs, by way of comparison, have a basic 64-bit data structure. The floating point calculation performance of the PlayStation2 machine is 6.2 GFLOPs. The Emotion Engine also comprises MPEG2 decoder circuitry which allows for simultaneous processing of 3D graphics data and DVD data. The Emotion Engine performs geometrical calculations including mathematical transforms and translations and also performs calculations associated with the physics of simulation objects, for example, calculation of friction between two objects. It produces sequences of image rendering commands which are subsequently utilised by the Graphics Synthesiser200. The image rendering commands are output in the form of display lists. A display list is a sequence of drawing commands that specifies to the Graphics Synthesiser which primitive graphic objects (e.g. points, lines, triangles, sprites) to draw on the screen and at which co-ordinates. Thus a typical display list will comprise commands to draw vertices, commands to shade the faces of polygons, render bitmaps and so on. The Emotion Engine100can asynchronously generate multiple display lists.

The Graphics Synthesiser200is a video accelerator that performs rendering of the display lists produced by the Emotion Engine100. The Graphics Synthesiser200includes a graphics interface unit (GIF) which handles, tracks and manages the multiple display lists. The rendering function of the Graphics Synthesiser200can generate image data that supports several alternative standard output image formats, i.e., NTSC/PAL, High Definition Digital TV and VESA. In general, the rendering capability of graphics systems is defined by the memory bandwidth between a pixel engine and a video memory, each of which is located within the graphics processor. Conventional graphics systems use external Video Random Access Memory (VRAM) connected to the pixel logic via an off-chip bus which tends to restrict available bandwidth. However, the Graphics Synthesiser200of the PlayStation2 machine provides the pixel logic and the video memory on a single high-performance chip which allows for a comparatively large 38.4 Gigabyte per second memory access bandwidth. The Graphics Synthesiser is theoretically capable of achieving a peak drawing capacity of 75 million polygons per second. Even with a full range of effects such as textures, lighting and transparency, a sustained rate of 20 million polygons per second can be drawn continuously. Accordingly, the Graphics Synthesiser200is capable of rendering a film-quality image.

The Sound Processor Unit (SPU)300is effectively the soundcard of the system which is capable of recognising 3D digital sound such as Digital Theater Surround (DTS®) sound and AC-3 (also known as Dolby Digital) which is the sound format used for DVDs.

A display and sound output device305, such as a video monitor or television set with an associated loudspeaker arrangement310, is connected to receive video and audio signals from the graphics synthesiser200and the sound processing unit300.

The main memory supporting the Emotion Engine100is the RDRAM (Rambus Dynamic Random Access Memory) module500produced by Rambus Incorporated. This RDRAM memory subsystem comprises RAM, a RAM controller and a bus connecting the RAM to the Emotion Engine100.

FIG. 2schematically illustrates the architecture of the Emotion Engine100ofFIG. 1. The Emotion Engine100comprises: a floating point unit (FPU)104; a central processing unit (CPU) core102; vector unit zero (VU0)106; vector unit one (VU1)108; a graphics interface unit (GIF)110; an interrupt controller (INTC)112; a timer unit114; a direct memory access controller116; an image data processor unit (IPU)118; a dynamic random access memory controller (DRAMC)120; a sub-bus interface (SIF)122; and all of these components are connected via a 128-bit main bus124.

The CPU core102is a 128-bit processor clocked at 300 MHz. The CPU core has access to 32 MB of main memory via the DRAMC120. The CPU core102instruction set is based on MIPS III RISC with some MIPS IV RISC instructions together with additional multimedia instructions. MIPS III and IV are Reduced Instruction Set Computer (RISC) instruction set architectures proprietary to MIPS Technologies, Inc. Standard instructions are 64-bit, two-way superscalar, which means that two instructions can be executed simultaneously. Multimedia instructions, on the other hand, use 128-bit instructions via two pipelines. The CPU core102comprises a 16 KB instruction cache, an 8 KB data cache and a 16 KB scratchpad RAM which is a portion of cache reserved for direct private usage by the CPU.

The FPU104serves as a first co-processor for the CPU core102. The vector unit106acts as a second co-processor. The FPU104comprises a floating point product sum arithmetic logic unit (FMAC) and a floating point division calculator (FDIV). Both the FMAC and FDIV operate on 32-bit values so when an operation is carried out on a 128-bit value (composed of four 32-bit values) an operation can be carried out on all four parts concurrently. For example adding 2 vectors together can be done at the same time.

The vector units106and108perform mathematical operations and are essentially specialised FPUs that are extremely fast at evaluating the multiplication and addition of vector equations. They use Floating-Point Multiply-Adder Calculators (FMACs) for addition and multiplication operations and Floating-Point Dividers (FDIVs) for division and square root operations. They have built-in memory for storing micro-programs and interface with the rest of the system via Vector Interface Units (VIFs). Vector unit zero106can work as a coprocessor to the CPU core102via a dedicated 128-bit bus so it is essentially a second specialised FPU. Vector unit one108, on the other hand, has a dedicated bus to the Graphics synthesiser200and thus can be considered as a completely separate processor. The inclusion of two vector units allows the software developer to split up the work between different parts of the CPU and the vector units can be used in either serial or parallel connection.

Vector unit zero106comprises 4 FMACS and 1 FDIV. It is connected to the CPU core102via a coprocessor connection. It has 4 Kb of vector unit memory for data and 4 Kb of micro-memory for instructions. Vector unit zero106is useful for performing physics calculations associated with the images for display. It primarily executes non-patterned geometric processing together with the CPU core102.

Vector unit one108comprises 5 FMACS and 2 FDIVs. It has no direct path to the CPU core102, although it does have a direct path to the GIF unit110. It has 16 Kb of vector unit memory for data and 16 Kb of micro-memory for instructions. Vector unit one108is useful for performing transformations. It primarily executes patterned geometric processing and directly outputs a generated display list to the GIF110.

The GIF110is an interface unit to the Graphics Synthesiser200. It converts data according to a tag specification at the beginning of a display list packet and transfers drawing commands to the Graphics Synthesiser200whilst mutually arbitrating multiple transfer. The interrupt controller (INTC)112serves to arbitrate interrupts from peripheral devices, except the DMAC116.

The timer unit114comprises four independent timers with 16-bit counters. The timers are driven either by the bus clock (at 1/16 or 1/256 intervals) or via an external clock. The DMAC116handles data transfers between main memory and peripheral processors or main memory and the scratch pad memory. It arbitrates the main bus124at FMAC and FDIV operate on 32-bit values so when an operation is carried out on a 128-bit value (composed of four 32-bit values) an operation can be carried out on all four parts concurrently. For example adding 2 vectors together can be done at the same time.

The timer unit114comprises four independent timers with 16-bit counters. The timers are driven either by the bus clock (at 1/16 or 1/256 intervals) or via an external clock. The DMAC116handles data transfers between main memory and peripheral processors or main memory and the scratch pad memory. It arbitrates the main bus124at the same time. Performance optimisation of the DMAC116is a key way by which to improve Emotion Engine performance. The image processing unit (IPU)118is an image data processor that is used to expand compressed animations and texture images. It performs I-PICTURE Macro-Block decoding, colour space conversion and vector quantisation. Finally, the sub-bus interface (SIF)122is an interface unit to the IOP700. It has its own memory and bus to control I/O devices such as sound chips and storage devices.

FIG. 3schematically illustrates the configuration of the Graphic Synthesiser200. The Graphics Synthesiser comprises: a host interface202; a set-up/rasterizing unit; a pixel pipeline206; a memory interface208; a local memory212including a frame page buffer214and a texture page buffer216; and a video converter210.

The host interface202transfers data with the host (in this case the CPU core102of the Emotion Engine100). Both drawing data and buffer data from the host pass through this interface. The output from the host interface202is supplied to the graphics synthesiser200which develops the graphics to draw pixels based on vertex information received from the Emotion Engine100, and calculates information such as RGBA value, depth value (i.e. Z-value), texture value and fog value for each pixel. The RGBA value specifies the red, green, blue (RGB) colour components and the A (Alpha) component represents opacity of an image object. The Alpha value can range from completely transparent to totally opaque. The pixel data is supplied to the pixel pipeline206which performs processes such as texture mapping, fogging and Alpha-blending and determines the final drawing colour based on the calculated pixel information.

The pixel pipeline206comprises 16 pixel engines PE1, PE2, . . . , PE16 so that it can process a maximum of 16 pixels concurrently. The pixel pipeline206runs at 150 MHz with 32-bit colour and a 32-bit Z-buffer. The memory interface208reads data from and writes data to the local Graphics Synthesiser memory212. It writes the drawing pixel values (RGBA and Z) to memory at the end of a pixel operation and reads the pixel values of the frame buffer214from memory. These pixel values read from the frame buffer214are used for pixel test or Alpha-blending. The memory interface208also reads from local memory212the RGBA values for the current contents of the frame buffer. The local memory212is a 32 Mbit (4 MB) memory that is built-in to the Graphics Synthesiser200. It can be organised as a frame buffer214, texture buffer216and a 32-bit Z-buffer215. The frame buffer214is the portion of video memory where pixel data such as colour information is stored.

The Graphics Synthesiser uses a 2 D to 3 D texture mapping process to add visual detail to 3 D geometry. Each texture may be wrapped around a 3 D image object and is stretched and skewed to give a 3 D graphical effect. The texture buffer is used to store the texture information for image objects. The Z-buffer215(also known as depth buffer) is the memory available to store the depth information for a pixel. Images are constructed from basic building blocks known as graphics primitives or polygons. When a polygon is rendered with Z-buffering, the depth value of each of its pixels is compared with the corresponding value stored in the Z-buffer. If the value stored in the Z-buffer is greater than or equal to the depth of the new pixel value then this pixel is determined visible so that it should be rendered and the Z-buffer will be updated with the new pixel depth. If however the Z-buffer depth value is less than the new pixel depth value the new pixel value is behind what has already been drawn and will not be rendered.

The local memory212has a 1024-bit read port and a 1024-bit write port for accessing the frame buffer and Z-buffer and a 512-bit port for texture reading. The video converter210is operable to display the contents of the frame memory in a specified output format.

The present embodiment will be described with reference to a karaoke game (although it will be appreciated that embodiments of the invention are not limited to karaoke games). It is known for karaoke games to assess a player's singing by analysing the pitch of the player's voice input to the system unit10via the microphone730. As such, this form of karaoke game will not be described in detail herein. However, there are songs that are not suited to such analysis, for example so-called “rap” songs, in which the performer, rather than singing the lyrics of the song, speaks the lyrics (albeit possibly rhythmically) to the accompanying backing track. As such, pitch analysis is not generally appropriate to this form of karaoke. The present embodiment therefore detects how well the player is speaking the lyrics. This will be described in more detail below.

It will be appreciated that it is possible for a karaoke game to have a mixture of rap and non-rap songs available, the karaoke game making use of pitch analysis for non-rap songs and speech analysis for rap songs. It will also be appreciated that these two types of analysis may be used in isolation (i.e. a player singing a non-rap song being assessed based on pitch analysis and not speech analysis; a player singing a rap song being assessed based on speech analysis and not pitch analysis) or in combination (i.e. a player singing a rap or non-rap song being assessed based on both pitch analysis and speech analysis).

FIG. 4schematically illustrates the logical functionality of a PlayStation2 machine in respect of an embodiment of the invention. The functions of blocks shown inFIG. 4are, of course, carried out, mainly by execution of appropriate software, by parts of the PlayStation2 machine as shown inFIG. 1, the particular parts of the PlayStation2 machine concerned being listed below. The software could be provided from disk or ROM storage and/or via a transmission medium such as an internet connection.

To execute the above-mentioned karaoke game, control logic800initiates the replay of an audio backing track from a disk storage medium810. The audio replay is handled by replay logic820and takes place through an amplifier307forming part of the television set305, and the loudspeaker310also forming part of the television set305.

The replay or generation of a video signal to accompany the audio track is also handled by the replay logic820. Background images/video may be stored on the disk storage medium810or may instead be synthesised. Graphical overlays representing the lyrics to be sung/spoken are also generated in response to data from a song file830to be described below. The output video signal is displayed on the screen of the television set305.

The microphone730is also connected to the replay logic820. The replay logic820converts the digitised audio signal from the microphone back into an analogue signal and supplies it to the amplifier307so that a player can hear his own voice through the loudspeaker310.

The song file830will now be described.

The song file830stores data defining the lyrics and notes which the user has to sing/speak to complete a current song, together with speech analysis information for performing the speech analysis. A part of an example of a song file is shown schematically inFIG. 5.

The song file830is expressed in XML format and starts with a measure of the song's tempo, expressed in beats-per-minute (not shown inFIG. 5). The next term is a measure of resolution (not shown inFIG. 5), i.e. what fraction of a beat is used in the note duration figures appearing in that song file830. For example, if the resolution is “semiquaver” and the tempo is 96 beats per minute, then a note duration value of “1” corresponds to a quarter of a beat, or in other words 1/384 of a minute.

A number of “NOTE” attributes follow, the “NOTE” attributes forming part of a “SENTENCE” attribute. A note attribute comprises a “MidiNote” value, which represents a particular pitch at which the player is expected to sing, and a “Duration” value, which represents the duration of that note (which can be calculated in seconds by reference to the tempo and resolution information as described above). A “MidiNote” value of zero (i.e. no note) represents a pause for a particular duration (such as between words). In the midi-scale, middle C is represented by midi-number60. The note A above middle C, which has a standard frequency of 440 Hz, is represented by midi-number69. Each octave is represented by a span of 12 in the midi-scale, so (for example) top C (C above middle C) is represented by midi-number72. It should be noted that in some systems, the midi-number 0 is assigned to bottom C (about 8.175 Hz), but in the present embodiment the midi-number 0 indicates a pause with no note expected.

A “NOTE” attribute also comprises a “Rap” value which, when set to a value of “Yes” indicates that the current note and associated lyric are to be assessed based on speech analysis rather than pitch analysis and, when set to a value of “No”, indicates that the current note and associated lyric are to be assessed based on pitch analysis rather than speech analysis. The song file part shown inFIG. 5is for a rap song and, as such, the “Rap” values of the “NOTE” attributes are set to “Yes”. As such, the actual “MidiNote” value is not used in when assessing the player's performance of this song (as rap songs are assessed using speech analysis rather than pitch analysis).

A “NOTE” attribute may also comprise a “FreeStyle” value, which, when set to a value of “Yes” indicates that the song can be performed and analysed based on pitch analysis (rather than speech analysis) if, for example, the version of the game being played does not have functionality to handle rap songs (i.e. when the game version does not have the speech analysis functionality). This could occur, for example, when the disk storage medium810containing the song file830is used to supply data to a game version that is already being executed by the system unit10, that game version being an earlier game version than the game version stored on the disk storage medium810. Alternatively, the karaoke game may provide the user with the option to not analyse a rap song using speech analysis, but rather to use pitch analysis instead. In this case, setting the “Freestyle” value to “Yes” indicates that this option is available for this song.

A “NOTE” attribute also comprises a “Lyric” value. This might be a part of a word, a whole word or even (in some circumstances) more than one word. It is possible for the word to be empty, for example (in XML) “Lyric=“ ””, which may be used to when a pause is required (such as when the “MidiNote” value is set to 0).

A “NOTE” attribute may also comprise one or more “POINT” attributes. A “POINT” attribute specifies information that is used to perform the speech analysis for a particular “NOTE” attribute. More particularly, a “POINT” attributes specifies a target action (characteristic of or event in the input audio from the microphone730) that is to be tested for to determine whether or not the user has successfully completed that “NOTE” (or a part thereof).

A “POINT” attribute also comprises a “Type” value that specifies the particular characteristic/event that is to be tested. The meaning of these point types, and the way in which they are detected, will now be described.

Overview

The type detection system begins with raw microphone samples and ends with per-note verdicts, which are used by the scoring system in the game. Here, ‘low-level processing’ refers to the sequence of steps from raw samples to a list of ‘speech events’. ‘Scoring’ refers to the steps which go from these speech events to the per-note verdicts as to what “type” of event took place during a note period. Fundamentally, the system works by recognising basic speech elements in the microphone data and attempting to correlate them with a set of expected features tagged up in the melody file.

Low-Level Processing

In the type detection system, a ‘frame’ is 128 consecutive raw microphone samples. This corresponds to 1/375 of a second. Typically, six or ten frames are processed during each television frame.

In the first instance, each frame is classified as one of ‘silence’, ‘fricative noise’ or ‘vocal noise’. The fricative/vocal decision is based on the roughness of the wave-form, and uses a simplified form of a so-called Hurst analysis, which is used to quantify the roughness of functional fractals.

Fricative noise has a relatively high-frequency, relatively wide-band content which generates a very rough waveform that behaves like a function with a high fractal dimension—an example being shown inFIG. 7B. Vocal noise on the other hand has a relatively low-frequency, relatively narrow-band spectrum that has a low fractal dimension, an example being shown inFIG. 7A.

Hurst analysis can be understood in the following way:

The process starts with a set of sampled time-series data. The average magnitude of the change in sample value between samples that are, say, ten samples apart can be calculated. If an equivalent value is then calculated for a sample separation of 20, it would generally be expected to deliver a higher value. In fact if the data represent a straight line of non-zero gradient, or even a smooth curve, the second value will generally be twice as big as the first.

On the other hand, for a maximally fractal (i.e. very rough) set of data, the two values would tend to be about the same. This is the case for white noise, which consists of consecutive independently random samples. In this case, the distance between two samples has no statistical influence over the likely variation in their values.

The degree to which the average change in sample value increases with increasing sample separation provides a well defined quantification of the roughness of the data. In the context of a microphone waveform, this approach is extremely fast compared to spectral techniques, because it avoids the use of FFTs.

A full Hurst calculation would calculate values for many sample separations. In the rap system of the present embodiment, it has been found to be sufficient to use just three: the 3- and 12-sample separations (ν3and ν12), and the large separation limiting value, ν0, which is just the mean sample value in a frame. Put simply, ν0, ν12and ν3essentially correspond to measures of volume tuned to low-, medium- and high-frequency bands. These values are formally defined by:

v0=∑i⁢abs⁢{si},⁢v3=∑i⁢abs⁢{si-si-3},⁢v12=∑i⁢abs⁢{si-si-12},

The voice/fricative decision for the frame is then dependent on the ratios between the three volumes. Essentially, comparatively higher high-frequency volumes will lead to a fricative result; and a higher low-frequency volumes will lead to a voiced result.

The criterion for deciding that a noisy frame is fricative rather than vocal, arrived at by empirical trials, is:

v3v0+v3v12>1.75

A silence conclusion will occur if all three volumes are below thresholds which are determined dynamically on a relatively short timescale.

The skilled man will of course understand that these outcomes are statistically generated based on a rapid analysis of a sample of the audio signal. They may or may not represent the true situation, but have been found in empirical tests to give a useful indication for the purposes of scoring a karaoke type game.

Special Case for Low Frequency Voices

There is a special case which has to be considered for low frequency voices. Since the duration of a frame ( 1/375 second) is substantially shorter than the period of the wave-form for a low audio frequency, it is possible for the lowest frequency values (ν0) to exhibit a periodic fluctuation (e.g. as a result of aliasing effects), where perhaps one or two out of every three frames records an artificially low value. In the absence of a correction, this could cause a rapid alternation between fricative and vocal conclusions.

A correction can be made by increasing the value of ν0for each frame to the larger of its two immediate neighbours. As an example, consider the sequence of ν0values 551551551551, where each successive single-digit value represents a level of ν0in a corresponding frame. This example sequence might be detected for a voiced input at about 120 Hz. The sequence is automatically corrected, using the above rule, to 555555555555. This results, correctly, in a stream of consecutive voice results.

Speech Events

In the next level of processing, sequences of consecutive similar results are grouped into a list of intervals, or speech events. Each interval is described by a start time, a stop time and a categorisation of silence, fricative or voice. This process is shown schematically inFIG. 8, in which frames of 128 samples (top row) are classified as Fricative (F), Silence (S) or Voice (V) (second row), and these classifications are grouped into speech events (bottom row).

The criteria for switching from one state to another are as follows:

To fricativeTo voiceTo silenceFrom—3 consecutive voice4 consecutivefricative:frames.silent framesFrom2 consecutive fricative—4 consecutivevoice:framessilent framesFrom2 consecutive fricative5 voiced frames out—silence:framesof 7, and aminimumcumulative volume.

In addition, there are two situations where multiple events can occur simultaneously.

Two noisy events within the same category will be combined into one if they are separated by a very short silent interval. In this case, the two intervals are added to the feature list individually, and as a single combined interval. The separation intervals are 0.06 s for voiced intervals, and 0.035 s for fricative intervals.

The ν0profile during a voiced interval is continuously monitored. A very narrow trough in the volume vs time relationship during the note is interpreted as an ‘almost silence’. The interval is then split into two voiced intervals. The unbroken interval is also added to the feature list.

Type Detection and Scoring

The rap scoring works by comparing the list of intervals generated from the 15 microphone data to a list of expected events described in the melody file. There are nine different types of events that the scoring system can detect:

TypeUsage“Fric”Extended fricative sound, eg “FFF” or “SSS”.“Short Fric”Plosive fricative sound, eg “k” or “t”.“Fric End”End of a fricative sound.“Voice”A voiced period, eg “ah” in “Fast”.“Voice Begin”Commencement of a voiced syllable.“Voice End”End of a voiced period.“Voice Inside”Fully voiced interval“Stop”A brief period of silence corresponding to theclosing of the airway, usually preceding a plosivesound. Eg between the “S” and the “t” in “stop”.“Glottal”A scoring type which will accept either a shortfricative or a stop.“Start”Not a scoring element. “Start” points mark outsections of the song to be considered in isolation.

The replay logic820uses the song file830to determine what to display on the screen of the television305to prompt the player how to play the karaoke game. This may include (i) the “Lyric” value so that the player knows what words to sing/speak and (ii) “POINT” attributes so that the player knows what target actions are being assessed.

In the present example, the song file830does not define the backing track which the player will hear and which in fact will help prompt the player to sing the song. The backing track is recorded separately, for example as a conventional audio recording. This arrangement means that the backing track replay and the reading of data from the song file830need to start at related times (e.g. substantially simultaneously), something which is handled by the control logic800. However, in other embodiments the song file830could define the backing track as well, for example as a series of midi-notes to be synthesised into sounds by a midi synthesiser (not shown inFIG. 4, but actually embodied within the SPU300ofFIG. 1).

Returning toFIG. 4, a note clock generator840reads the tempo and resolution values from the song file830and provides a clock signal at the appropriate rate. In particular, the rate is the tempo multiplied by the sub-division of each beat. So, for example, for a tempo of 96 and a resolution of “semiquaver” (quarter-beat) the note clock840runs at 96×4 beats-per-minute, i.e. 384 beats-per-minute. If the tempo were 90 and the resolution were “quaver” (half-beat) then the note clock840would run at 180 (90×2) beats-per-minute, and so on.

The note clock840is used to initiate reading out from the song file830of the various attributes of the song and also to control the detection of audio input (the player's actions) from the microphone730by a detector850.

With regard to speech analysis, the signal from the USB microphone730is also supplied to the detector850. This operates as described above. The detector850receives a difficulty level value as set by the player of the karaoke game. This difficulty level may be used to alter how the detection of characteristics/event occurs. For example, when comparing input frequency domain data with known frequency domain characteristics, the detector alters the degree to which the input frequency domain data must match the known frequency domain characteristics in order for a detection of, say, a fricative sound, to occur. Alternatively, the threshold levels for starting/ending/maintaining speech input can be varied to make the detection of speech/silence easier or more difficult to achieve. Finally, the period in which the characteristic/event is to occur may be altered to make detection easier or more difficult (such as extending a period to make detection easier/more likely).

Although full speech recognition could be used to determine what actual words are being spoken by the player, this is not actually necessary. Indeed, due to the additional processing overhead that this could cause and the possible additional latency, such full speech recognition is less preferable than the speech analysis described above.

The detection result (e.g. silence, speech beginning, fricative sound, etc.) is supplied to a buffer register860and from there to a comparator870. The corresponding target action (desired characteristics/event) from the song file830is also read out, under the control of the note clock840, and is supplied to another buffer register880before being passed to the comparator870.

The comparator870is arranged to compare the target action from the song file with the/any detected characteristic/event occurring in the input audio signal (i.e. the player's action) during the period specified in the song file830.

For scoring purposes, the comparator870detects whether the player has performed the target action (i.e. whether the input audio signal contains the desired characteristic/event during the interval specified by the song file830). This is achieved by a comparison of the data stored in the buffer registers860,880. This result is passed to scoring logic890.

The scoring logic890receives the results of the comparison by the comparator870and generates a player's score from them. At the beginning of a song, the scoring logic is set to a non-scoring mode. In the non-scoring mode, the scoring logic890detects whether the results of the comparisons by the comparator870conform to a first pattern of successfully and unsuccessfully completed target actions. For example, this first pattern could be nine successfully completed target actions with at most three unsuccessfully completed target actions occurring amongst the nine successfully completed target actions. If the scoring logic890detects the first pattern, it enters a scoring mode (leaving the non-scoring mode). Whilst it will be appreciated that the first pattern could be any pattern of successfully and unsuccessfully completed target actions, in the current embodiment, the first pattern is a sequence of successive target actions that have been successfully completed, the sequence having a predetermined length. As such, the scoring logic maintains a count, in a register892, of the number of successive target actions (events/characteristics specified by the song file830) that are successfully completed by the player (as detected by the detector850and the comparator870). When this count exceeds a threshold number, the scoring logic enters a scoring mode.

When in the scoring mode, the scoring logic890detects whether the results of the comparisons by the comparator870conform to a second pattern of successfully and unsuccessfully completed target actions. For example, this second pattern could be four unsuccessfully completed target actions with at most one successfully completed target action occurring amongst the four unsuccessfully completed target actions. If the scoring logic890detects the second pattern, it enters the non-scoring mode (leaving the scoring mode). Whilst it will be appreciated that the second pattern could be any pattern of successfully and unsuccessfully completed target actions, in the current embodiment, the second pattern is a sequence of successive target actions that have not been successfully completed, the sequence having a predetermined length. As such, the scoring logic890maintains a count, in a register894, of the number of successive target actions (events/characteristics specified by the song file830) that are not successfully completed by the player (as detected by the detector850and the comparator870). When this count exceeds a threshold number, the scoring logic returns to the non-scoring mode.

The player only accumulates a score whilst the scoring logic is in the scoring mode.

FIG. 6schematically illustrates the operation of the scoring logic890.

At a step S100, the karaoke game begins for a song in the rap-mode. This corresponds to a song for which the song file830has the “Rap” value set to “Yes” in its “NOTE” attributes.

At a step S102, the values of the registers892and894, indicating respectively the number of successive target actions that have been successfully completed and the number of successive target actions that have not been successfully completed, are reset to 0 and the scoring logic890is set in the non-scoring mode.

At a step S104, the detector850detects a characteristic/event in the input audio at the time specified in the song file830for the next target action. The comparator870then determines whether the detected characteristic/event corresponds to that next target action, i.e. whether that next target action has been successfully completed by the player. The scoring logic890is informed of this determination. It should be noted that throughout a song, the system maintains a list of all the scoring events that are expected to occur within about one second of the current time. Each event generated by the microphone is compared to all the scoring points in the current list. The scoring point is as successful as the most closely matched event.

Each scoring point is attached to a note. Each note can have zero, one or more scoring points attached to it. In addition to evaluating these scoring points, the system also monitors whether or not any noise was detected at the microphone during the note.

For a successful verdict on a note, there are two requirements: (a) some noise must have been detected during the note, and (b) (as described below with reference to step S112) the note must be inside a run of several successful scoring points.

If, at a step S106, the comparator has determined that the next target action has not been successfully completed, then processing returns to the step S102. However, if it is determined that the next target action has been successfully completed, then processing continues at a step S108at which the value in the register892is incremented.

If, at a step S110, it is determined that the number of successive target actions that have been successfully completed (as recorded in the register892) does not exceed a success threshold value, then processing returns to the step S104. However, if it is determined that the number of successive target actions that have been successfully completed (as recorded in the register892) does exceed the success threshold value, then processing continues at a step S112at which the values in the registers892and894are reset to 0 and the scoring logic890enters a scoring mode.

At a step S114, the detector850detects a characteristic/event in the input audio at the time specified in the song file830for the next target action. The comparator870then determines whether the detected characteristic/event corresponds to that next target action, i.e. whether that next target action has been successfully completed by the player.

The scoring logic890is informed of this determination. This is the same as at the step S104.

If, at a step S116, the comparator has determined that the next target action has been successfully completed, then processing continues at a step S118at which the player's score is incremented (as the scoring logic890is in the scoring mode and the player has successfully completed a target action). Processing then continues at a step S120at which the value of the register S894is set to 0. Processing returns to the step S114.

However, if it is determined at the step S116that the next target action has not been successfully completed, then processing continues at a step S122at which the value in the register894is incremented.

If, at a step S124, it is determined that the number of successive target actions that have not been successfully completed (as recorded in the register894) does not exceed a failure threshold value, then processing returns to the step S114. However, if it is determined that the number of successive target actions that have not been successfully completed (as recorded in the register894) does exceed the failure threshold value, then processing returns to the step S102, thereby leaving the scoring mode and returning to the non-scoring mode.

The success and failure threshold values used at the steps S110and S124respectively may be different values and, in preferred embodiments, the success threshold value used at the step S110is larger than the failure threshold value used at the step S124. Furthermore, in preferred embodiments, the scoring logic adjusts the success and failure thresholds in dependence upon a difficulty level set by the player. In particular, the more difficult the level, the greater the success threshold and/or the lower the failure threshold.

It will be appreciated that the registers892,894(and possibly extra registers as required) may be used by the scoring logic890to detect other patterns of successfully and unsuccessfully completed target actions, these registers storing a history of the pattern of target actions that have been successfully and unsuccessfully completed by the player's actions.

FIG. 4schematically illustrates the operation of the embodiment as a set of logical blocks, for clarity of the description. Of course, although the blocks could be implemented in hardware, or in semi-programmable hardware (e.g. field programmable gate array(s)), these blocks may conveniently be implemented by parts of the PlayStation2 machine system schematically illustrated inFIG. 1under the control of suitable software. One example of how this may be achieved is as follows:

Control logic 800Emotion engine 100, accessing:Note clock generator 840HDD 390,Detector 850ROM 400,Registers 860, 880, 892, 894RDRAM 500 etcComparator 870Scoring logic 890Replay logic 820Emotion engine 100, accessing:DVD/CD interface 450SPU 300 (for audio output)IOP 700 (for microphone input)GS 200 (for video output)

In so far as the embodiments of the invention described above are implemented, at least in part, using software-controlled data processing apparatus, it will be appreciated that a computer program providing such software control and a storage medium by which such a computer program is stored are envisaged as aspects of the present invention.

Although illustrative embodiments of the invention have been described in detail herein with reference to the accompanying drawings, it is to be understood that the invention is not limited to those precise embodiments, and that various changes and modifications can be effected therein by one skilled in the art without departing from the scope and spirit of the invention as defined by the appended claims.

Claims

Game processing apparatus comprising: an indication generator operable to generate an indication of a sequence of target actions to be executed by a user;a detector operable to detect user actions executed by said user;comparison logic operable to compare said target actions with said user actions to determine whether said target actions have been successfully completed by said user actions;and scoring logic operable in a scoring mode and a non-scoring mode, said scoring logic being operable: (a) in said non-scoring mode, to count a first pattern comprising a first plurality of successive target actions that have been successfully completed, as determined by said comparison logic, said scoring logic entering a scoring mode upon detection of said first pattern, said count being reset to zero by a target action not successfully completed;and (b) in said scoring mode, to count a second pattern comprising a second plurality of successive target actions that have been successfully completed, as determined by said comparison logic, said count being reset to zero by a target action successfully completed, said scoring logic entering said non-scoring mode upon detection of said second pattern;in which said scoring logic is operable to generate a score for said user in respect of each user action representing successful completion of a target action, only when said scoring logic is in said scoring mode, said score being dependent upon said determination by said comparison logic.

Apparatus according to claim 1 , in which said first pattern comprises a first predetermined threshold number of successive target actions that have been successfully completed and said second pattern comprises a second predetermined threshold number of successive target actions that have not been successfully completed.
Apparatus according to claim 2 , in which said first predetermined threshold is greater than said second predetermined threshold.
Apparatus according to claim 1 , in which: said target actions comprise said user making and/or not making target sounds;said user actions comprise said user making and/or not making sounds;and said detector comprises a microphone.
Apparatus according to claim 4 , in which said indication comprises an indication of one or more words to be spoken by said user, said words corresponding to said target sounds.
Apparatus according to claim 4 , in which one or more of said target sounds comprises a fricative sound and/or a glottal sound.
Apparatus according to claim 1 , in which said target actions have associated target time periods of execution, said comparison logic determining whether one of said target actions has been successfully completed by one of said user actions in dependence upon whether said one of said user actions is detected within said target time period associated with said one of said target actions.
Apparatus according to claim 7 , operable to vary said predetermined time period in dependence upon a game level.
Apparatus according to claim 2 , operable to vary said first predetermined threshold or said second predetermined threshold in dependence upon a game level.
Apparatus according to claim 1 , comprising: a reader operable to read data stored on a storage medium when said storage medium is associated with said reader, said data defining said target actions.
A method of game processing comprising: providing a processor, said processor coupled with a storage device, the storage device storing instructions configured to direct the processor to perform steps of: an indication generating step to generate an indication of a sequence of target actions to be executed by a user;a detection step to detect user actions executed by said user;a comparison step to compare said target actions with said user actions to determine whether said target actions have been successfully completed by said user actions;and a scoring step including—scoring logic operable in a scoring mode and a non-scoring mode, said scoring logic being operable: (a) in said non-scoring mode, to count a first pattern comprising a first plurality of successive target actions that have been successfully completed as determined by said comparison step, and to enter a scoring mode upon detection of said first pattern, said count being reset to zero by a target action not successfully completed;and (b) in said scoring mode, to count a second pattern comprising a second plurality of successive target actions that have been successfully completed as determined by said comparison step, said count being reset to zero by a target action successfully completed, and to enter said non-scoring mode upon detection of said second pattern;in which said scoring step generates a score for said user in respect of each user action representing successful completion of a target action, only when said scoring step is in said scoring mode, said score being dependent upon said determination by said comparison step.

More Claims Show Fewer Claims

Disclaimer: Data collected from the USPTO and may be malformed, incomplete, and/or otherwise inaccurate.