Site hosted by Angelfire.com: Build your free website today!

Introduction

1 The Latent Budget Model

The latent budget model (LBM) is a mixture model for compositional data and enables us to obtain insight in a compositional data set without the worries of a troubled covariance matrix. By Performing latent budget analysis (LBA) we approximate I observed budgets, which may represent persons, groups or objects, by a small number of latent budgets, consisting of typical characteristics of the sample. Such approximation can be used for classification, for example.

The idea of the LBM was proposed by Goodman (1974a), and elaborated by Clogg (1981) by interpreting a simple latent class model in an asymmetric way. Independently, de Leeuw & van der Heijden (1988) introduced the model and named it ``latent budget analysis'' because they used it to analyze time-budget data. The model was also introduced independently in geology by Renner (1988; 1993), where it is known as the endmember model.

Consider an I × J compositional data matrix P, consisting of I observed budgets p _i, with components p_j|i. In the LBM the observed budgets p _i are approximated by expected budgets p_i, which are mixtures of K (K £ min(I,J)) typical compositions or latent budgets. The latent budgets are denoted by b_k (k = 1, ..., K), and the model can be written as

p_i = a_1|i b₁ + ... + a_k|i b_k + ... + a_K|i b_K

(i = 1, ..., I)

(1)

where a_k|i (i = 1, ..., I; k = 1, ..., K) are the mixing parameters. The elements of p_i are p_j|i (j = 1, ..., J) and are called expected components. The elements of b_k are b_j|k (j = 1, ..., J) and are called latent components. Alternative notations for (1) are the scalar notation

p_j|i =

K
å
k = 1

a_k|i b_j|k

(i = 1, ..., I; j = 1, ..., J)

(2)

and matrix notation

P = AB^T

(3)

In (3) P is an I ×J matrix whose rows are the expected budgets; A is an I ×K matrix of mixing parameters, and B is an J ×K matrix whose columns are the latent budgets. The superscript ``T'' denotes the transpose of a matrix. The latent budget model with K latent budgets is denoted as LBM(K). Similar to the observed components, the parameters of the LBM are subject to the sum constraints

J
å
j = 1

p_j|i =

K
å
k = 1

a_k|i =

J
å
j = 1

b_j|k = 1

(4)

and the nonnegativity constraints

0 £ p_j|i £ 1, 0 £ a_k|i £ 1, 0 £ b_j|k £ 1.

(5)

Thus, all parameters are proportions and this facilitates the interpretation of the model. In fact, it has been argued frequently that its ease of interpretation is one of the main reasons to use LBA (for example de Leeuw & van der Heijden, 1988; de Leeuw et al., 1990; van der Ark & van der Heijden, 1998; van den Brakel, 1996).

If the data have a product-multinomial distribution, we can compute the unconditional expected probabilities p_ij from the expected components. The following properties hold for the expected components and the corresponding unconditional probabilities:

p_ij = p_j|i / p_i+,

p_i+ = p_i+,

p_+j = p_+j

(see de Leeuw et al., 1990).

Van der Heijden, Mooijaart & de Leeuw (1992) proposed two ways to interpret the model (see also van der Ark & van der Heijden, 1998), which we will call the mixture model interpretation and the MIMIC-model interpretation (Multiple Indicator Multiple Cause-model, Goodman, 1974a; see also Clogg, 1981). Thus far we have treated the LBM as a mixture model and the interpretation given earlier is as follows: the LBM writes the expected budgets as a mixture of a small number of typical, or latent, budgets. Hence, each expected budget is built up out of the K latent budgets, and the mixing parameters determine to what extent. The latent budgets can be characterized by comparing them with the latent budget of LBM(1). LBM(1) is the independence model, with a_1|i = 1 (i = 1, ..., I) and b_j|1 = p_+j (j = 1, ..., J), in this case p_i = b₁ for i = 1, ..., I. Hence, if latent component b_j|k is greater than that component in the independence model, p_+j, then b_k is characterized by the j-th category. On the other hand, if a b_j|k is less than p_+j, then the j-th category is of lesser importance. The relative importance of each latent budget, in terms of how much of the expected data they account for, is expressed by the budget proportions p_k = å_i p_i+ a_k|i. p_k (k = 1, ..., K) also denotes the probability of latent budget k when there is no information about the level of the row variable. To understand how the expected budgets are constructed from the latent budgets we must compare the mixing parameters to p_k. If a_k|i > p_k then expected budget p_i is characterized more than average by latent budget b_k, and If a_k|i < p_k then expected budget p_i is characterized less than average by latent budget b_k. In practice the mixture model interpretation is carried out most easily when we first characterize the latent budgets and then interpret the expected budgets in terms of the latent budgets.

Table 1: Voting behavior by city type in the 1986 elections in the Netherlands, in frequencies (upper half) and in components (lower half). The political parties are PvdA (labor party), CDA (Christian democrats), VVD (liberals), D66 (democrats), left (other left-wing parties), right (other right-wing parties).

PvdA CDA VVD D66 left right
Rural 285 482 186 49 21 60 1083
Rural industrialized 620 914 308 102 42 97 2083
Commuter 355 460 347 104 36 47 1349
Small city 336 337 168 62 27 46 976
Middle large city 548 455 233 91 47 43 1417
Large city 903 516 343 153 110 37 2062
Rural .263 .445 .172 .045 .019 .055 1.00
Rural industrialized .298 .439 .148 .049 .020 .047 1.00
Commuter .263 .341 .257 .077 .027 .035 1.00
Small city .344 .345 .172 .064 .028 .047 1.00
Middle large city .387 .321 .164 .064 .033 .030 1.00
Large city .438 .250 .166 .074 .053 .018 1.00
Source: Statistics Netherlands (1987)

For interpreting the LBM as a MIMIC-model, we view the observed components as conditional proportions of the row variable X (for example ``city type'' in Table 1), with I categories, and the column variable Y (for example ``political party'' in Table 1), with J categories. If we assume that the row variable and the column variable are independent given some latent variable Z with K categories, then the LBM describes the relationship between the row variable and the column variable in an asymmetric way, i.e. p_j|i = P(Y = j|X = i) denotes the probability to respond to category j of Y, given that one belongs to the i-th category of X; these probabilities are explained by a_k|i = P(Z = k|X = i), which is the probability that row category i belongs to latent category k, and b_j|k = P(Y = j|Z = k), which is the probability that a member of latent category k responds to the j-th category of Y.

If the compositional data do not have a product multinomial distribution then the MIMIC-model interpretation may be troublesome: for example, if each observed budgets represents a multivariate observation on a single subject, then it is unclear what P(Y = j|Z = k) means. If the rows of the compositional data are not independent, for example if they denote groups, and people may belong to more than one group, then P(Z = k|Y = i) is not well defined.

A graphic representation of a mixture model and a MIMIC-model is given in Figure . In the left panel of Figure the squares represent the expected budgets p_i and the circles the latent budgets b_k. The arrows represent the mixing parameters a_k|i. In the right panel the squares on the left-hand side represent the row categories and the squares on the right-hand side represent the column categories. The arrows on the left-hand side represent the mixing parameters a_k|i and the arrows on the right-hand side represent the latent components b_j|k.

Graphic display of a Mixture model and a MIMIC-model.

As an example of LBA we analyzed the data in Table 1 with LBM(3). The parameters are in Table and have been identified ¹.

Table 2: Mixing parameters and latent components of the LBM(3) solution of the election data in Table 1. The budget proportions (p_k) and the independence budget (p₊₁, ..., p_+j, ..., p_+J) are also given.

row categories mixing parameters
k = 1 k = 2 k = 3 indep
Rural 0.12 0.65 0.23 1.00
Rural Ind. 0.18 0.61 0.20 1.00
Commuter 0.20 0.44 0.36 1.00
Small city 0.30 0.46 0.24 1.00
Medium city 0.39 0.38 0.22 1.00
Large city 0.54 0.32 0.23 1.00
p_k 0.31 0.45 0.24 1.00
column categories latent components
k = 1 k = 2 k = 3 indep
PvdA 0.70 0.28 0.00 0.34
CDA 0.16 0.63 0.08 0.35
VVD 0.00 0.01 0.71 0.18
D66 0.06 0.00 0.18 0.06
Left-wing 0.08 0.00 0.03 0.03
Right-wing 0.00 0.08 0.00 0.03

The mixture model interpretation of Table 2 is as follows: first, we interpret the latent budgets by comparing them to the independence budget. The first latent budget has greater proportions in the components ``PvdA'' (labor) and the ``small left-wing'' parties than the independence budget, and can be described as a ``socialist budget''. The second latent budget has greater proportions in the components ``CDA'' (Christian Democrats) and ``small right-wing'' parties than the independence budget. Since the right-wing parties are conservative Christian parties, we can describe this budget as a ``christian/conservative budget''. The third latent budget has greater proportions in the components ``VVD'' (right-wing liberals) and ``D66'' (left-wing liberals) and can be described as a ``liberal'' budget.

By comparing the mixing parameters to the budget proportions p_k, we see that the rural and rural industrialized areas predominantly have a christian/conservative voting pattern. Commuters are predominantly liberal. The small cities display the average voting pattern, because the mixing parameters are almost equal to p_k. In the larger cities the socialist budget is most important.

Alternatively, from the MIMIC-model interpretation we may conclude that subjects from the rural areas have a higher than average probability to be a member of latent stage 2, commuters have a probability higher than average to be a member of latent stage 3, and subjects from the bigger cities have a probability higher than average to be a member of latent stage 1. Subjects of latent stage 1 predominantly vote left-wing and PvdA (labor), subjects of latent stage 2 vote right-wing and CDA (Christian Democrats) and subjects of latent stage 3 predominantly vote liberal (VVD and D66). Interpretation of the latent stages is often difficult. However, we can describe latent stage 1 as the level dominated by left-wing oriented city people, latent stage 2 as a level dominated by religious rural people, and latent stage 3 as a level dominated by liberal commuters.

We conclude this Section with the remark that alternative notations for the parameters are in use, that more or less explicitly indicate that the parameters are conditional proportions. A review is presented in Table . Sometimes, for example in the discussion on the relationship between LBA and Latent Class Analysis (LCA; see van der Ark & van der Heijden, 1996, Section 5) an alternative notation is more convenient.

Table 3: Alternative notations for the observed components and the latent budget parameters.

component 1 2 3 4
latent component b_jk b_j|k p_j|k p_jk^[`X]Z
observed component p_ij p_j|i p_j|i p_ij^X[`Y]
expected component p_ij p_j|i p_j|i p_ij^X[`Y]
mixing parameter a_ik a_k|i p_k|i p_ik^X[`Z]
1 = de Leeuw et al. (1990).
2 = van der Ark et al. (in press), this monograph.
3 = van der Ark & van der Heijden (1998).
4 = van der Heijden et al. (1992); LCA literature.

Footnotes:

¹ The parameter estimates of the LBM should be identified before the latent budget solution can be interpreted. This problem is discussed in Chapter 2. Here we identified the parameter estimates with the outer extreme solution (OES), see van der Ark, van der Heijden & Sikkel (1999).

Back to main page

	PvdA	CDA	VVD	D66	left	right
Rural	285	482	186	49	21	60	1083
Rural industrialized	620	914	308	102	42	97	2083
Commuter	355	460	347	104	36	47	1349
Small city	336	337	168	62	27	46	976
Middle large city	548	455	233	91	47	43	1417
Large city	903	516	343	153	110	37	2062
Rural	.263	.445	.172	.045	.019	.055	1.00
Rural industrialized	.298	.439	.148	.049	.020	.047	1.00
Commuter	.263	.341	.257	.077	.027	.035	1.00
Small city	.344	.345	.172	.064	.028	.047	1.00
Middle large city	.387	.321	.164	.064	.033	.030	1.00
Large city	.438	.250	.166	.074	.053	.018	1.00
Source: Statistics Netherlands (1987)

row categories	mixing parameters
	k = 1	k = 2	k = 3	indep
Rural	0.12	0.65	0.23	1.00
Rural Ind.	0.18	0.61	0.20	1.00
Commuter	0.20	0.44	0.36	1.00
Small city	0.30	0.46	0.24	1.00
Medium city	0.39	0.38	0.22	1.00
Large city	0.54	0.32	0.23	1.00
p_k	0.31	0.45	0.24	1.00
column categories	latent components
	k = 1	k = 2	k = 3	indep
PvdA	0.70	0.28	0.00	0.34
CDA	0.16	0.63	0.08	0.35
VVD	0.00	0.01	0.71	0.18
D66	0.06	0.00	0.18	0.06
Left-wing	0.08	0.00	0.03	0.03
Right-wing	0.00	0.08	0.00	0.03

component	1	2	3	4
latent component	b_jk	b_j\|k	p_j\|k	p_jk^[`X]Z
observed component	p_ij	p_j\|i	p_j\|i	p_ij^X[`Y]
expected component	p_ij	p_j\|i	p_j\|i	p_ij^X[`Y]
mixing parameter	a_ik	a_k\|i	p_k\|i	p_ik^X[`Z]
1 = de Leeuw et al. (1990).
2 = van der Ark et al. (in press), this monograph.
3 = van der Ark & van der Heijden (1998).
4 = van der Heijden et al. (1992); LCA literature.