238 lines
9.7 KiB
Markdown
238 lines
9.7 KiB
Markdown
|
# Type inference engine
|
||
|
|
||
|
This is loosely based on standard HM-type inference, but with an
|
||
|
extension to try and accommodate subtyping. There is nothing
|
||
|
principled about this extension; it's sound---I hope!---but it's a
|
||
|
heuristic, ultimately, and does not guarantee that it finds a valid
|
||
|
typing even if one exists (in fact, there are known scenarios where it
|
||
|
fails, some of which may eventually become problematic).
|
||
|
|
||
|
## Key idea
|
||
|
|
||
|
The main change is that each type variable T is associated with a
|
||
|
lower-bound L and an upper-bound U. L and U begin as bottom and top,
|
||
|
respectively, but gradually narrow in response to new constraints
|
||
|
being introduced. When a variable is finally resolved to a concrete
|
||
|
type, it can (theoretically) select any type that is a supertype of L
|
||
|
and a subtype of U.
|
||
|
|
||
|
There are several critical invariants which we maintain:
|
||
|
|
||
|
- the upper-bound of a variable only becomes lower and the lower-bound
|
||
|
only becomes higher over time;
|
||
|
- the lower-bound L is always a subtype of the upper bound U;
|
||
|
- the lower-bound L and upper-bound U never refer to other type variables,
|
||
|
but only to types (though those types may contain type variables).
|
||
|
|
||
|
> An aside: if the terms upper- and lower-bound confuse you, think of
|
||
|
> "supertype" and "subtype". The upper-bound is a "supertype"
|
||
|
> (super=upper in Latin, or something like that anyway) and the lower-bound
|
||
|
> is a "subtype" (sub=lower in Latin). I find it helps to visualize
|
||
|
> a simple class hierarchy, like Java minus interfaces and
|
||
|
> primitive types. The class Object is at the root (top) and other
|
||
|
> types lie in between. The bottom type is then the Null type.
|
||
|
> So the tree looks like:
|
||
|
>
|
||
|
> ```text
|
||
|
> Object
|
||
|
> / \
|
||
|
> String Other
|
||
|
> \ /
|
||
|
> (null)
|
||
|
> ```
|
||
|
>
|
||
|
> So the upper bound type is the "supertype" and the lower bound is the
|
||
|
> "subtype" (also, super and sub mean upper and lower in Latin, or something
|
||
|
> like that anyway).
|
||
|
|
||
|
## Satisfying constraints
|
||
|
|
||
|
At a primitive level, there is only one form of constraint that the
|
||
|
inference understands: a subtype relation. So the outside world can
|
||
|
say "make type A a subtype of type B". If there are variables
|
||
|
involved, the inferencer will adjust their upper- and lower-bounds as
|
||
|
needed to ensure that this relation is satisfied. (We also allow "make
|
||
|
type A equal to type B", but this is translated into "A <: B" and "B
|
||
|
<: A")
|
||
|
|
||
|
As stated above, we always maintain the invariant that type bounds
|
||
|
never refer to other variables. This keeps the inference relatively
|
||
|
simple, avoiding the scenario of having a kind of graph where we have
|
||
|
to pump constraints along and reach a fixed point, but it does impose
|
||
|
some heuristics in the case where the user is relating two type
|
||
|
variables A <: B.
|
||
|
|
||
|
Combining two variables such that variable A will forever be a subtype
|
||
|
of variable B is the trickiest part of the algorithm because there is
|
||
|
often no right choice---that is, the right choice will depend on
|
||
|
future constraints which we do not yet know. The problem comes about
|
||
|
because both A and B have bounds that can be adjusted in the future.
|
||
|
Let's look at some of the cases that can come up.
|
||
|
|
||
|
Imagine, to start, the best case, where both A and B have an upper and
|
||
|
lower bound (that is, the bounds are not top nor bot respectively). In
|
||
|
that case, if we're lucky, A.ub <: B.lb, and so we know that whatever
|
||
|
A and B should become, they will forever have the desired subtyping
|
||
|
relation. We can just leave things as they are.
|
||
|
|
||
|
### Option 1: Unify
|
||
|
|
||
|
However, suppose that A.ub is *not* a subtype of B.lb. In
|
||
|
that case, we must make a decision. One option is to unify A
|
||
|
and B so that they are one variable whose bounds are:
|
||
|
|
||
|
UB = GLB(A.ub, B.ub)
|
||
|
LB = LUB(A.lb, B.lb)
|
||
|
|
||
|
(Note that we will have to verify that LB <: UB; if it does not, the
|
||
|
types are not intersecting and there is an error) In that case, A <: B
|
||
|
holds trivially because A==B. However, we have now lost some
|
||
|
flexibility, because perhaps the user intended for A and B to end up
|
||
|
as different types and not the same type.
|
||
|
|
||
|
Pictorally, what this does is to take two distinct variables with
|
||
|
(hopefully not completely) distinct type ranges and produce one with
|
||
|
the intersection.
|
||
|
|
||
|
```text
|
||
|
B.ub B.ub
|
||
|
/\ /
|
||
|
A.ub / \ A.ub /
|
||
|
/ \ / \ \ /
|
||
|
/ X \ UB
|
||
|
/ / \ \ / \
|
||
|
/ / / \ / /
|
||
|
\ \ / / \ /
|
||
|
\ X / LB
|
||
|
\ / \ / / \
|
||
|
\ / \ / / \
|
||
|
A.lb B.lb A.lb B.lb
|
||
|
```
|
||
|
|
||
|
|
||
|
### Option 2: Relate UB/LB
|
||
|
|
||
|
Another option is to keep A and B as distinct variables but set their
|
||
|
bounds in such a way that, whatever happens, we know that A <: B will hold.
|
||
|
This can be achieved by ensuring that A.ub <: B.lb. In practice there
|
||
|
are two ways to do that, depicted pictorially here:
|
||
|
|
||
|
```text
|
||
|
Before Option #1 Option #2
|
||
|
|
||
|
B.ub B.ub B.ub
|
||
|
/\ / \ / \
|
||
|
A.ub / \ A.ub /(B')\ A.ub /(B')\
|
||
|
/ \ / \ \ / / \ / /
|
||
|
/ X \ __UB____/ UB /
|
||
|
/ / \ \ / | | /
|
||
|
/ / / \ / | | /
|
||
|
\ \ / / /(A')| | /
|
||
|
\ X / / LB ______LB/
|
||
|
\ / \ / / / \ / (A')/ \
|
||
|
\ / \ / \ / \ \ / \
|
||
|
A.lb B.lb A.lb B.lb A.lb B.lb
|
||
|
```
|
||
|
|
||
|
In these diagrams, UB and LB are defined as before. As you can see,
|
||
|
the new ranges `A'` and `B'` are quite different from the range that
|
||
|
would be produced by unifying the variables.
|
||
|
|
||
|
### What we do now
|
||
|
|
||
|
Our current technique is to *try* (transactionally) to relate the
|
||
|
existing bounds of A and B, if there are any (i.e., if `UB(A) != top
|
||
|
&& LB(B) != bot`). If that succeeds, we're done. If it fails, then
|
||
|
we merge A and B into same variable.
|
||
|
|
||
|
This is not clearly the correct course. For example, if `UB(A) !=
|
||
|
top` but `LB(B) == bot`, we could conceivably set `LB(B)` to `UB(A)`
|
||
|
and leave the variables unmerged. This is sometimes the better
|
||
|
course, it depends on the program.
|
||
|
|
||
|
The main case which fails today that I would like to support is:
|
||
|
|
||
|
```text
|
||
|
fn foo<T>(x: T, y: T) { ... }
|
||
|
|
||
|
fn bar() {
|
||
|
let x: @mut int = @mut 3;
|
||
|
let y: @int = @3;
|
||
|
foo(x, y);
|
||
|
}
|
||
|
```
|
||
|
|
||
|
In principle, the inferencer ought to find that the parameter `T` to
|
||
|
`foo(x, y)` is `@const int`. Today, however, it does not; this is
|
||
|
because the type variable `T` is merged with the type variable for
|
||
|
`X`, and thus inherits its UB/LB of `@mut int`. This leaves no
|
||
|
flexibility for `T` to later adjust to accommodate `@int`.
|
||
|
|
||
|
### What to do when not all bounds are present
|
||
|
|
||
|
In the prior discussion we assumed that A.ub was not top and B.lb was
|
||
|
not bot. Unfortunately this is rarely the case. Often type variables
|
||
|
have "lopsided" bounds. For example, if a variable in the program has
|
||
|
been initialized but has not been used, then its corresponding type
|
||
|
variable will have a lower bound but no upper bound. When that
|
||
|
variable is then used, we would like to know its upper bound---but we
|
||
|
don't have one! In this case we'll do different things depending on
|
||
|
how the variable is being used.
|
||
|
|
||
|
## Transactional support
|
||
|
|
||
|
Whenever we adjust merge variables or adjust their bounds, we always
|
||
|
keep a record of the old value. This allows the changes to be undone.
|
||
|
|
||
|
## Regions
|
||
|
|
||
|
I've only talked about type variables here, but region variables
|
||
|
follow the same principle. They have upper- and lower-bounds. A
|
||
|
region A is a subregion of a region B if A being valid implies that B
|
||
|
is valid. This basically corresponds to the block nesting structure:
|
||
|
the regions for outer block scopes are superregions of those for inner
|
||
|
block scopes.
|
||
|
|
||
|
## Integral and floating-point type variables
|
||
|
|
||
|
There is a third variety of type variable that we use only for
|
||
|
inferring the types of unsuffixed integer literals. Integral type
|
||
|
variables differ from general-purpose type variables in that there's
|
||
|
no subtyping relationship among the various integral types, so instead
|
||
|
of associating each variable with an upper and lower bound, we just
|
||
|
use simple unification. Each integer variable is associated with at
|
||
|
most one integer type. Floating point types are handled similarly to
|
||
|
integral types.
|
||
|
|
||
|
## GLB/LUB
|
||
|
|
||
|
Computing the greatest-lower-bound and least-upper-bound of two
|
||
|
types/regions is generally straightforward except when type variables
|
||
|
are involved. In that case, we follow a similar "try to use the bounds
|
||
|
when possible but otherwise merge the variables" strategy. In other
|
||
|
words, `GLB(A, B)` where `A` and `B` are variables will often result
|
||
|
in `A` and `B` being merged and the result being `A`.
|
||
|
|
||
|
## Type coercion
|
||
|
|
||
|
We have a notion of assignability which differs somewhat from
|
||
|
subtyping; in particular it may cause region borrowing to occur. See
|
||
|
the big comment later in this file on Type Coercion for specifics.
|
||
|
|
||
|
### In conclusion
|
||
|
|
||
|
I showed you three ways to relate `A` and `B`. There are also more,
|
||
|
of course, though I'm not sure if there are any more sensible options.
|
||
|
The main point is that there are various options, each of which
|
||
|
produce a distinct range of types for `A` and `B`. Depending on what
|
||
|
the correct values for A and B are, one of these options will be the
|
||
|
right choice: but of course we don't know the right values for A and B
|
||
|
yet, that's what we're trying to find! In our code, we opt to unify
|
||
|
(Option #1).
|
||
|
|
||
|
# Implementation details
|
||
|
|
||
|
We make use of a trait-like implementation strategy to consolidate
|
||
|
duplicated code between subtypes, GLB, and LUB computations. See the
|
||
|
section on "Type Combining" below for details.
|