*In which you learn how to make objects farther away from the viewer appear smaller.*

In order for a rendered scene to feel real, you must simulate how light passes through the scene and into the human eye. One important quality of reality is that objects farther from the eye appear smaller than those near. Orthographic projections do not scale objects according to their depth. For that, you need a perspective projection.

Some ancient thinkers hypothesized that sight is a force that the eye emits into its environment, and the sight rays would reel in color from the surfaces they hit. Blinking made the world go dark.

These days, scientists understand that photons are emitted from light sources and hit the surfaces of the world. Photons are absorbed by the surfaces, reflected from the surfaces, or refracted through the surfaces. Which of these occurs depends on the frequency of the light and the material properties of the surfaces.

Some of the non-absorbed light bounces into your eye, passing through the lens at its front. Real lenses have apertures that narrow and widen to change how much light enters. In computer graphics, you assume the lens of the simulated eye is very small, the size of a pinhole. The photons land on nerve cells at the back of the eye. This collection of cells is called the retina in a real human eye and an image plane in computer graphics.

You can figure out where a photon from an object will land on the retina by drawing a line from the lens to the object. Photons from below the eye land at the top of the retina, while photons from above land at the bottom. This means that the retina receives a picture of the world that is upside down.

This flipping of the world is not a phenomenon you need to mimic in your renderers. In computer graphics, the image plane is moved from behind the lens to in front of the lens. When light projects onto an image plane in front of the viewer, it will be be rightside up.

An object that is farther away will have a line with a smaller slope than a closer object of the same size, and will therefore project to a smaller area of the image plane. That is perspective.

To simulate how light projects this flattened perspective view of the world on the retina, you will need to do some math.

The orthographic projection does not try to simulate human vision. In particular, it removes the concept of a lens that funnels light through a pinhole. Photons instead travel in rays perpendicular to the image plane. The viewing volume of an orthographic projection is an extrusion of the image plane along these perpendicular rays, which produces a box.

In a perspective projection, light does pass through a lens, which means only a subset of the rays land on the image plane. The chunk of the world that is seen is not a box, but rather a pyramid. Many graphics libraries expect you to size this pyramid through the following four parameters:

- The vertical field of view. How many degrees tall is pyramid?
- The aspect ratio. What is the pyramid's width:height ratio?
- The near distance. At what point distance does the eye start perceiving?
- The far distance. At what point distance does the eye stop perceiving?

The pyramid is artificially truncated by the near and far distances. A truncated pyramid is called a frustum. Explore how the viewing frustum of a perspective projection is shaped by these four parameters:

In your own renderers, you must figure out how to turn this frustum into the unit cube that WebGL expects. Mapping the orthographic projection's rectangular prism to a unit cube requires just a translation and a scaling. Mapping a pyramid is lot more work.

You must first determine the y-coordinate of the near face's top edge. Consider this side profile of eye space:

See the smaller right triangle in the top half of the frustum? Since you know the angle of this triangle at the eye (\(\frac{\mathrm{fov}_y}{2}\)) and the length of the adjacent side (\(\mathrm{near}\), you can use a little trigonometry to determine the top y-coordinate:

$$ \begin{aligned} \frac{\mathrm{top}}{\mathrm{near}} &= \tan \frac{\mathrm{fov}_y}{2} \\ \mathrm{top} &= \tan \frac{\mathrm{fov}_y}{2} \times \mathrm{near} \end{aligned} $$

How much of the world can we see to the right? You know the frustum's aspect ratio, which relates its width and height:

$$ \begin{aligned} \frac{\mathrm{right}}{\mathrm{top}} &= \mathrm{aspect\ ratio} \\ \mathrm{right} &= \mathrm{aspect\ ratio} \times \mathrm{top} \\ \end{aligned} $$

The parameters \(\mathrm{near}\), \(\mathrm{far}\), \(\mathrm{top}\), and \(\mathrm{right}\) are what you'll use to build a perspective projection matrix.

With our frustum defined, you are ready to figure out where on the image plane each vertex projects. Assume that the vertex's eye space position is \(\mathbf{p}_\mathrm{eye}\). You want the projected position \(\mathbf{p}_\mathrm{plane}\):

You know the z-coordinate of the projected position:

$$ z_\mathrm{plane} = -\mathrm{near} $$

The y-coordinate you figure out by setting up similar triangles. Since the triangles are similar, the ratio of their side lengths must match:

$$ \begin{aligned} \frac{y_\mathrm{plane}}{\mathrm{near}} &= \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\ y_\mathrm{plane} &= \mathrm{near} \times \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\ \end{aligned} $$

The x-component is computed similarly:

$$ \begin{aligned} x_\mathrm{plane} &= \mathrm{near} \times \frac{x_\mathrm{eye}}{-z_\mathrm{eye}} \\ \end{aligned} $$

The \(z_\mathrm{eye}\) term is negated because you want the positive distance of the vertex from the eye.

The image plane must be mapped to the unit cube of normalized space. You want coordinates at the top of the plane to map to 1. You want coordinates at the right of the frustum to map to 1. To normalize your coordinates, you divide your plane position by the \(\mathrm{top}\) and \(\mathrm{right}\) values you derived earlier:

$$ \begin{aligned} x_\mathrm{norm} &= \frac{\mathrm{near}}{\mathrm{right}} \times \frac{x_\mathrm{eye}}{-z_\mathrm{eye}} \\ y_\mathrm{norm} &= \frac{\mathrm{near}}{\mathrm{top}} \times \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\ \end{aligned} $$

Ignore the z-component for the moment because it's messy.

The transformation pipeline is built around matrices. You want to build a matrix that transforms your eye space coordinates into normalized coordinates. In particular, you want this to happen:

$$ \begin{bmatrix} ? & ? & ? & ? \\ ? & ? & ? & ? \\ ? & ? & ? & ? \\ ? & ? & ? & ? \\ \end{bmatrix} \times \begin{bmatrix} x_\mathrm{eye} \\ y_\mathrm{eye} \\ z_\mathrm{eye} \\ 1 \end{bmatrix} = \begin{bmatrix} \frac{\mathrm{near}}{\mathrm{right}} \times \frac{x_\mathrm{eye}}{-z_\mathrm{eye}} \\ \frac{\mathrm{near}}{\mathrm{top}} \times \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\ ? \\ 1 \end{bmatrix} $$

What row when dotted with the eye space position will produce \(x_\mathrm{norm}\)? None. It's not possible to bring both \(x_\mathrm{eye}\) and \(z_\mathrm{eye}\) into the same term with a dot product. What will you do? Once again, it appears that the matrix system is broken.

Never fear. The GPU designers snuck in a hack. They decided that instead of targeting normalized space directly, you will target an intermediate space called clip space. In clip space, the coordinates have not been divided by \(-z_\mathrm{eye}\). After you emit a position in this undivided clip space, the GPU will divide all components of the position by the value that appears in the position's homogeneous coordinate. Since your normalized coordinates have \(-z_\mathrm{eye}\) in their denominator, that's the value you want as your homogeneous coordinate.

This then is the transformation whose matrix you are trying to build:

$$ \begin{bmatrix} ? & ? & ? & ? \\ ? & ? & ? & ? \\ ? & ? & ? & ? \\ ? & ? & ? & ? \\ \end{bmatrix} \times \begin{bmatrix} x_\mathrm{eye} \\ y_\mathrm{eye} \\ z_\mathrm{eye} \\ 1 \end{bmatrix} = \begin{bmatrix} \frac{\mathrm{near}}{\mathrm{right}} \times x_\mathrm{eye} \\ \frac{\mathrm{near}}{\mathrm{top}} \times y_\mathrm{eye} \\ ? \\ -z_\mathrm{eye} \end{bmatrix} $$

The division by the homogeneous coordinate is called the perspective divide. That divide lands you at the normalized coordinates you want:

$$ \begin{bmatrix} \frac{\mathrm{near}}{\mathrm{right}} \times x_\mathrm{eye} \\ \frac{\mathrm{near}}{\mathrm{top}} \times y_\mathrm{eye} \\ ? \\ -z_\mathrm{eye} \end{bmatrix} \div -z_\mathrm{eye} = \begin{bmatrix} \frac{\mathrm{near}}{\mathrm{right}} \times \frac{x_\mathrm{eye}}{-z_\mathrm{eye}} \\ \frac{\mathrm{near}}{\mathrm{top}} \times \frac{y_\mathrm{eye}}{-z_\mathrm{eye}} \\ ? \\ 1 \end{bmatrix} $$

The intermediate space right before the perspective divide is called clip space because in that space the GPU performs clipping. Any geometry that lies outside the viewing frustum is clipped out and not processed by the fragment shader.

The perspective divide frees you up to deduce a few rows of the perspective matrix. The x- and y-components are scaled, and the bottom row selects out and negates the z-component to form the correct homogeneous coordinate:

$$ \begin{bmatrix} \frac{\mathrm{near}}{\mathrm{right}} & 0 & 0 & 0 \\ 0 & \frac{\mathrm{near}}{\mathrm{top}} & 0 & 0 \\ ? & ? & ? & ? \\ 0 & 0 & -1 & 0 \\ \end{bmatrix} \times \begin{bmatrix} x_\mathrm{eye} \\ y_\mathrm{eye} \\ z_\mathrm{eye} \\ 1 \end{bmatrix} = \begin{bmatrix} \frac{\mathrm{near}}{\mathrm{right}} \times x_\mathrm{eye} \\ \frac{\mathrm{near}}{\mathrm{top}} \times y_\mathrm{eye} \\ ? \\ -z_\mathrm{eye} \end{bmatrix} $$

You have constructed 75% of the matrix. All that's left is the last 125%.

The third row of the perspective matrix still has not been determined. You know that this dot product operation is going to happen to compute \(z_\mathrm{clip}\):

$$ \begin{bmatrix} ? & ? & ? & ? \end{bmatrix} \cdot \begin{bmatrix} x_\mathrm{eye} & y_\mathrm{eye} & z_\mathrm{eye} & 1 \end{bmatrix} = z_\mathrm{clip} $$

You must reason out what the unknowns should be. A position's \(z_\mathrm{clip}\) does not depend on the eye space position's x- or y-components, so you fill in a couple of zeroes:

$$ \begin{bmatrix} 0 & 0 & ? & ? \end{bmatrix} \cdot \begin{bmatrix} x_\mathrm{eye} & y_\mathrm{eye} & z_\mathrm{eye} & 1 \end{bmatrix} = z_\mathrm{clip} $$

What the other two unknowns should be is less clear. Name them so you can do some algebra:

$$ \begin{bmatrix} 0 & 0 & a & b \end{bmatrix} \cdot \begin{bmatrix} x_\mathrm{eye} & y_\mathrm{eye} & z_\mathrm{eye} & 1 \end{bmatrix} = z_\mathrm{clip} $$

Expand the dot product to simplify:

$$ a \times z_\mathrm{eye} + b = z_\mathrm{clip} $$

Apply the perspective divide to these terms to land in normalized space:

$$ \frac{a \times z_\mathrm{eye} + b}{-z_\mathrm{eye}} = \frac{z_\mathrm{clip}}{-z_\mathrm{eye}} = z_\mathrm{norm} $$

Your two unknowns are still unknown. However, you have a couple of mathematical truths that will help you resolve them. First, because you are mapping to the unit cube, you know what \(z_\mathrm{norm}\) should be at \(-\mathrm{near}\) and \(-\mathrm{far}\):

$$ \begin{aligned} \frac{a \times -\mathrm{near} + b}{\mathrm{near}} &= -1 \\ \frac{a \times -\mathrm{far} + b}{\mathrm{far}} &= 1 \\ \end{aligned} $$

Two equations with two unknowns form a linear system that you can solve. Solve the first equation for \(b\):

$$ \begin{aligned} \frac{a \times -\mathrm{near} + b}{\mathrm{near}} &= -1 \\ a \times -\mathrm{near} + b &= -\mathrm{near} \\ b &= -\mathrm{near} - a \times -\mathrm{near} \\ &= a \times \mathrm{near} - \mathrm{near} \\ \end{aligned} $$

Substitute this expression for \(b\) in your second equation and solve for \(a\):

$$ \begin{aligned} \frac{a \times -\mathrm{far} + b}{\mathrm{far}} &= 1 \\ \frac{a \times -\mathrm{far} + a \times \mathrm{near} - \mathrm{near}}{\mathrm{far}} &= 1 \\ a \times -\mathrm{far} + a \times \mathrm{near} - \mathrm{near} &= \mathrm{far} \\ a \times -\mathrm{far} + a \times \mathrm{near} &= \mathrm{near} + \mathrm{far} \\ a(\mathrm{near} - \mathrm{far}) &= \mathrm{near} + \mathrm{far} \\ a &= \frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} - \mathrm{far}} \\ \end{aligned} $$

Substitute this expression for \(a\) back into the equation for \(b\) and simplify:

$$ \begin{aligned} b &= a \times \mathrm{near} - \mathrm{near} \\ &= \frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} - \mathrm{far}} \times \mathrm{near} - \mathrm{near} \\ &= \mathrm{near} \times \left(\frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} - \mathrm{far}} - 1\right) \\ &= \mathrm{near} \times \left(\frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} - \mathrm{far}} - \frac{\mathrm{near} - \mathrm{far}}{\mathrm{near} - \mathrm{far}}\right)\\ &= \mathrm{near} \times \frac{\mathrm{near} + \mathrm{far} - \mathrm{near} + \mathrm{far}}{\mathrm{near} - \mathrm{far}} \\ &= \mathrm{near} \times \frac{2 \times \mathrm{far}}{\mathrm{near} - \mathrm{far}} \\ &= \frac{2 \times \mathrm{near} \times \mathrm{far}}{\mathrm{near} - \mathrm{far}} \\ \end{aligned} $$

Whew. That algebra plugs in the last two holes in your matrix. Altogether, your perspective transformation looks like this:

$$ \begin{bmatrix} \frac{\mathrm{near}}{\mathrm{right}} & 0 & 0 & 0 \\ 0 & \frac{\mathrm{near}}{\mathrm{top}} & 0 & 0 \\ 0 & 0 & \frac{\mathrm{near} + \mathrm{far}}{\mathrm{near} - \mathrm{far}} & \frac{2 \times \mathrm{near} \times \mathrm{far}}{\mathrm{near} - \mathrm{far}} \\ 0 & 0 & -1 & 0 \\ \end{bmatrix} \times \begin{bmatrix} x_\mathrm{eye} \\ y_\mathrm{eye} \\ z_\mathrm{eye} \\ 1 \end{bmatrix} = \begin{bmatrix} x_\mathrm{clip} \\ y_\mathrm{clip} \\ z_\mathrm{clip} \\ -z_\mathrm{eye} \end{bmatrix} $$

The perspective matrix is complex enough that you would probably prefer to hide away its construction in a library routine. That's fine for now.

Try adding `Matrix4.fovPerspective`

to your library of code. Have it accept these parameters:

- the vertical field of view in degrees
- the aspect ratio of the frustum
- the near clipping distance
- the far clipping distance

Compute the top and right values as described above and then build your matrix.