Great question!
Below is a quick explanation that covers many details, but gives a general overview.
The eye tracker is made up of interconnected modules. Each does its own processing on a set of inputs, and exports its data as a set of outputs that may be used by later modules. Modules are chained into what we call a "pipeline."
Some examples of modules are:
The outputs of these modules are passed to later modules that need to know the pupil location, whether some point is obscured by glasses, or the location of a specific glint. There are dozens of modules, so we won’t go into all of them here.
However, in general, modules can be sorted into two categories: Image Processing or 3D Reconstruction.
The image processing portion attempts to localize and track features on the video (such as pupil, glints, iris, eyelid etc.). This starts with a basic shape detector to identify features in the 2D image such as the pupil ellipse. There are then various refiners that improve the initial estimation afterwards. Temporal algorithms are used, whereby information about the current frame can gleaned via information from the last frame.
All the image processing features are then used to reconstruct a virtual 3D model of the eye based on the optical block geometry, camera parameters, and headset lens properties. By using the corneal reflections and pupil center features, we can estimate cornea curvature radius and position, estimate position of the eye pivot point (eyeball position), and obtain the distance between pivot point and pupil (eyeball radius). In addition, all the detected features on the camera image can be projected to 3D space, where every feature (such as the pupil and glint position) have coordinates in 3D space. Thus every feature (pupil radius, etc.) can be estimated in physical units as opposed to pixels on the 2D image.
Once an accurate 3D model of the eye is constructed, we can estimate the gaze direction by directing a ray from the eyeball pivot through the pupil, then applying the optical-to-visual axis offset.
Of course, the eye tracker is under constant development, so details change from version to version, but the general approach is constant.
The depth estimation feature measures the dynamic IPD (interpupillary distance) and compares it against the fixed IOD (interocular distance) of the user. When they are equal, the user is looking far away. The smaller the IPD, the more the eyes are converged, and thus, the closer the user is looking.
Since the screen is viewed at a fixed distance through the HMD’s lenses, the lenses of the eyes do not refocus while in VR, even if looking at close or far objects in the virtual space. Since the 3D effect is achieved by showing each eye slightly different images, it can only be measured by vergence.