Term Lengths encountering Statistics
Layouting GUIs for multi-lingual software requires often prediction of the screen space that is needed for terms in different languages and alphabets. In this article we examine a given set of 385 key terms used in balance sheet accounting multiplied by 11 languages with methods of descriptive statistics. We will find and discuss patterns and relations that allow to determine the required screen space if the length of the English term is given. Methods and rules of thumb are derived that can be applied by practitioners who are developing and designing user interfaces for multiple languages.
Why in English?
When some of my Czech colleagues recently were showing interest in reading some of my articles, I became aware that writing this blog in English rather than in German language would make the presented thoughts more accessible not only to my colleagues in Prague but also to other non-German speaking persons. That’s why I will try to write my articles in English language in general and see how this works out. However, articles which deal with websites and applications layouted, written and labelled in German language (like the ones my students are developing) will continue to be in German.
Screen space for translated terms
Some months ago I came across a database file which stores the terms used in balance sheet accounting translated into several languages. When working on screen design in multilingual software and content I had wondered if there is a way of estimating the space that is required by the different translations of a given word / term / sentences. Particularly if one is in the situation that the translations are not done yet which is quite common in the production process. The unwanted risk of reserving too little space for the translated terms is: cutting off the term or flowing over to other areas. On the other hand the risk of overestimating the space required by translated terms: wasting screen space and creating unwanted gaps in the layout. Both risks should be avoided. Sure, but how can we tell? Let me put the question in this way: Are there any rules or even algorithms that tell how long a foreign language term might become when it’s translated from English into another language?
When I looked at the translation data I thought I might be playing with them and see what I can find out when I apply some statistical methods to these data. I might even be able to provide some rules of thumb. We should keep in our minds that these rules are restricted to the domain of balance sheet accounting across several languages and cannot be generalized to languages in general. However, we will gain insights from which we can derive rules and approaches for practitioners in screen design and texting. Let us have a look first at the key data of the relevant data base.
Key data of data base
Industry: Financial, Domain: balance sheet accounting
Number of terms (in each language, except Czech): 386
- English (EN)
- German (DE)
- Spanish (SP)
- French (FR)
- Italian (IT)
- Japanese (JP)
- Dutch (NL)
- Portuguese (PT)
- Russian (RU)
- Chinese (ZH)
- Czech (CZ)
We can consider the typographic parameters being normalized as all terms are formatted in Arial font with text size of 9 pt (in Excel).
Two examples, first hypotheses
Let’s look at one single example, the translations of the long term „Acquisition cost plant and factory equipment“.
Looking at the space occupied by the strings in the different languages we immediately can make a few observations in this particular example:
- The financial term does not consist of a single word but of a combination of several words.
- This applies also for the non-Latin language of Russian with Cyrillic. For Chinese and Japanese I just couldn’t tell by looking at the data.
- The smallest length has about 1/3rd size of the longest one.
- Chinese and Japanese need relative little space.
- Russian needs the most space.
- Some translations seem to incorporate abbreviations (German: „AHK“), others information added in brackets, making terms consisting of different words even more complex
Now let’s examine another example picked out of the database, the short term „Inventories“:
Some of the previous observations are confirmed, others are contradicted or need to be differentiated:
- In the second dataset the Czech translation needs most of the space, not the Russian translation.
- Chinese has still got the shortest length with two characters, but Japanese is quite wide.
One main observation of the term „inventories“ is that every translated term (including the english term itself) is much shorter than in the previous example. So we might suppose that there is some correlation between the length of an English term and its counterpart in other languages. The underlying reason for this one can call evident or even trivial: Simple things need only a short description in any language. To put it the other way round: if we have to describe something complex, we will need several words than describing something simple, in English language as well as in any other. And this relation of shared growth of term lengths might be stronger than the differences between our eleven languages – not necessarily in any case but as a general tendency. This is something to be examined later on as correlation between termlenghts of different languages.
Skimming through all the translations of 386 terms seems to confirm most of the observations made above. In order to get a better overview over the data let us focus first on counting the number of characters (= letters and spaces a term consists of). But before doing this we should state that number of characters of a term does not necessarily equals required space. Languages with Latin alphabet may differ in length depending on the frequency of letters in that particular language: „m“s require more space than „i“s. More important, there seems to be a visually large difference between Latin alphabet and non-Latin ones: We just need to look at our two examples shown previously to notice that Russian, Japanese and Chinese characters seem to require more pixel-width than the average Latin character. However, we start with counting the number of characters of each translated term (see this spreadsheet file „Term Lengths and Means per Language“ in googledocs) keeping in mind that this is not yet the metric we want to have in the very end.
Central Tendency of Term Lengths in different Languages
Let’s start having a look at the arithmetic mean of all the lengths across one particular language and compare those means in absolute numbers.
What we can read about the means from Graph 1:
- European languages range from 24.8 (English) to 32.4 (German) characters for the average term in this dataset of balance accounting terms.
- German language has the highest mean of all 11 languages which seems to confirm the general opinion that German texts „run“ rather long.
- Asian languages have dramatically low numbers: The average term counts 7.1 (Chinese) and 8.0 (Japanese). This is in line with our observations of the examples above as documented in Screenshots 1 + 2.
- Russian language is running with Cyrillic alphabet with 31.9 characters for the average term quite long as well, almost as long as German.
We can look at the same values from the English language point of view and compute ratios. To do so the average number of characters being 24.8 in absolute numbers is set to factor 1.0; the other values tell us which ratio the average term length has in the related languages. For instance, we compute that the average term length in German has a ratio value of 1.31 which says it does exceed the average English term length by nearly 1/3. On the other hand, the average Japanese term length is 0.32 and that tells us that in Japanese terms require only 1/3rd of the number of characters than in English. After all, this is giving us already quantitative information.
Taking a look at German
Let us take a closer look at the relation between English and German. From the above figures we derived already that we will need 1/3 more characters for a German term when we translate it from an English term of average length. Some examples from the primary data set are confirming this rough guess:
- EN: „Other short-term payables“ (25 char’s) > DE: „‚Sonstige Zahlungsverpflichtungen“ (32 char’s)
- EN: „‚Bank loans and overdrafts“ (25) > DE: „‚Bankdarlehen und Überziehungskredite“ (36 char’s)
If we filter all English terms with 25 characters (which is the mean for English) we notice that a few of the German translations are having less than 25, but the majority has clearly more than 30 or even 40 characters as Table 3 is showing.
This example let a derive a very rough rule of thumb:
Terms dealing with balance sheet accounting translated from English to German require often 1/3rd or more of term length.
Of course, this is not a scientific deduction; for instance we did not yet look at all the other English terms with more or less than 25 characters. Nonetheless, in some daily layout task applying such a rule of thumb might be better than just knowing nothing about the screen space that needs to be reserved. Before we move on to a broader view looking at the data, let’s switch the perspective to German language for the sake of German designers. We already found out that in our dataset the German terms have the highest number of characters in average. If we normalize Graph 1 to the mean of German terms we get a clear picture of the relations in Graph 3 and this differs a lot from the English perspective we saw before.
Compared with German, the average term of no other language is longer. The closest is Russian with a character length being 98% of German terms, which could exceed required space as the Cyrillic alphabet seems to have more character width – we will examine this aspect later on.
The conclusion comes to our mind that we do not need to care much about reserving additional space for other languages once we do the layout with German terms. Any of the other languages needs fewer characters. So this might be another insight from our data analysis. The Asian languages are placed at such low ratios against German, that it seems to be very unlikely that Japanese and Chinese terms exceed their German counterparts. And in fact: filtering the German terms with extreme few characters in the data set shows that the Chinese terms do not undercut their length, so it looks reasonable to assume that in almost every case the Asian term runs shorter than the corresponding German term.
From a designers perspective, there might a design danger be lurking from an unexpected side that we would need to keep an eye during layouting: such low ratios might result in a layout that could look unbalanced in Asian version because only a few letters occupy a space which had been designed for something four times larger.
Distribution of term lengths in different languages
Until now, we have been looking at the term lengths by the their central tendency in the different languages. We compared some of the means but became aware we need to look at all of the term lengths and their frequency within a particular language. Let’s now examine the distribution of all term lengths within one language and let us then compare the results across languages.
We are graphing histograms [Harris 1] along each language. In each graph we are transforming the discrete variable of term length (values can only be integers) into an interval variable that aggregates term lengths into bins starting from 1 up to 64 characters by steps of 4. This is what the frequency distributions look like for English and German if visualized as histograms:
The comparison of histogram of German with English terms shows:
- Both distributions have a positive skew: the maxima are more to the lower end of the x-axis and we have a tail to the right.
- English terms have no more than 52 characters
- English terms have a clear frequency maximum around x(EN) = 20
- Germany has no clear maximum but rather a plateau ranging from x(DE) = 20 to x(DE) = 32
- German seem to decrease in frequency f(G) after passing its high plateau but increase again towards the end of x(DE) = 64. Among all 11 histograms, German is the only language showing this characteristics and this needs a closer look at the primary data.
What we can see immediately from looking at the primary table being filtered with x(DE)=62,63,64 that some of the longest German terms are already abbreviated. Due to technical reasons the regular translation was already shortened by the translators to avoid exceeding 64 characters. Without this technical constraint the abbreviated terms would have had even more letters. In the histogram this would result in an even longer right tail and each bin showing fewer counts. So we can explain the unusual pattern in the DE histogram with two combined factors: First, German language is consuming per se a lot of letters for the given terms. Second, with a jam because of the technical limit not allowing more than 64 characters.
Taking a look at the primary data with filtering only the German terms with 63 to 64 characters (Table 4) tells us that in this results truncations are quite heavily applied; like in the example „Kum. Abschr. auf and., sachgerecht bez. Gruppen von Anl.-Gegenst“. In addition, we can observe that other translations followed the German abbreviations (e.g. English abbreviation „acc“ („accumulated“) for German „kum“ („kumuliert“) or left away certain parts of the German term (e.g. English „Materials“ for German „Materialaufwand (Roh-, Hilfs-, und Bestriebsst., Waren/Leistungen)“.
Let’s now extend the scope of our examination towards the frequency distribution of term lengths across all 11 languages. I have grouped the languages in the resulting multiple histogram by their well-known relation to the language families Germanic, Romanic, Slavic and Asian.
Looking at the histograms we get some insights into the characteristics:
- Strong confirmation that Chinese and Japanese terms are way shorter than their counterparts in the Germanic, Slavic or Romanic family.
- The Romanic languages SP, FR, IT, PT are similar to each other which was expectable from the well-known similarity of Romanic languages in words and grammers.
- But they don’t look that much different to the Germanic Dutch or Slavic Czech, which was a bit surprising me.
- Just Slavic Russian seems a have a different pattern as very long terms seem to happen quite often. This fits to our previous calculation that Russian is having the second highest mean. But then again: the visual pattern of the Russian histogram does not completely different to, let’s say, Spanish.
- And the pattern of Slavic Czech looks more similar to any of the Romanic languages than to Russian.
After all, the comparison of histograms does not give us very much precise rules when we think about the original aim of reserving space for translated terms. They give us rather a general impression about term lengths. But it is not yet clear if short English terms result in short German / Russian / French / … terms as well and what range of term length we can expect. If we want to examine this question we need to change our approach from analysing one variable to looking at the relation between two variables.
Contingency Tables and Scatterplots
In the next step we will look at the English terms as being one variable and any other language being a second variable. On the basis of our term length data we identify which English term lengths (x) result in which length (y) of another language and aggregate the frequencies in a contingency table. A contingency table gives the frequency f(x,y) of different combinations of values of two (or more) variables.
Scatter graphs (also „scatter plots“) are visualisations of two variables [Harris, 2]. They are useful to convey information about association between the two variables. In order to transfer the contingency table into a scatterplot, we transfer the table into cartesian coordinates. The x-axis represents the length of English terms, the Y-Axis that of the translated language which we want to call Y-Language. Each occurence of a English term with the length x(EN) being translated into the language Y with the length y(Y) results in a point in the cartesian plane. (x,y)-tupels which have no occurrence, i.e. f(x,y) = 0, are left empty. Tupels which occur more than once would be overplotted using this method. In statistics for this case the method of jittering is applied which adds some noise to the values resulting in placing the points near to each other instead of on top. I personally don’t like this visualisation workaround for its unprecision.
From the perspective of data analysis, the question is how we can graph a third dimension. For this case Exel is offering a 3D-Scatterplot-Chart. However, in our case with a range of 64 x-values and 64 y-values the result became quite unreadable. As our 3rd dimension frequency ranges only from 0 ≤ f(x,y) ≤ 14 and only very few are beyond the values of 4, an alternative solution could be to encode the frequency by different shades of the same color via conditional formatting of the cell. The resulting scatterplot is depicted in this tabular scatterplot realized in this google spreadsheet „Scattergraphs of Term Lengths across two languages“
Another approach is to apply transparency to the dots which also results in having a darker color where dots are plotted on top of each other. This is the method I have used to produce the scattergraphs depicted in this article.
Term Lengths English – German
Let us first analyse the relation between English and German term lengths in Graph 6. Every dot in the graph is showing one or more occurences with an English term of a length x and its German counterpart term with the length y. The darker the dot-color, the more frequent this tupel (EN,DE) occured. The resulting scatterplot is enhanced by a light grey diagonal which is showing where an English term and its German translation have exactly the same amount of characters. Not unexpectedly this does not happen very often. Nonetheless, this diagonal seems to be some kind of attractor as the dots are scattered along this diagonal. This pattern is showing us the force of positive correlation between these two languages, expressing the tendency of German translations to get longer in the same degree as the English do.
However, from being more dots being above the diagonal than being below we can read that it is more likely that the German term is longer than the corresponding English. And we can notice at the upper end of the GE axis that a lot of terms have the maximum length of 64 characters and this happens even more frequent when the English terms are between 40 and 50 characters. The blue line is the regression line based on a linear model; within this context we should regard it as summarizing the correlation between our two variables.
This same scatterplot supplies a complete second point of view, which is that we can also look from German perspective. We must only follow the values on the y-axis to the right to find out what is happening to German term lengths when they are translated to English. By this we can see that no dot is placed beyond x=51, meaning that no English term is longer than that – whatever the length of the German one. Of course, we could have concluded that from the histograms, but now we see it! And we see by looking at how many dots are placed left or right to the diagonal, that in general the English translations are shorter than the German terms. If they are longer, they are not exceedingly longer as those dots being right to the diagonal are close to it.
I consider this to be quite informative when being faced with the problem of reserving space for translations that we started off with. And I wonder if the conclusions could be applicable to language characteristics in general beyond the small and domain-specific set we use here. Anyway, let us continue looking closer at the relationships between the English and some other languages.
Term Lengths English – French
Right from the very first glance this Plot looks very different to the EN – DE graph. We see that all dots are quite close to the diagonal. This tells us that French terms do not differ much from their English counterparts.
A few exceptions can be observed near x = 48. However, the correlation appears visually to be very strong, which is confirmed by an calculated correlation coefficient r = 0.89. Also the vertical and horizontal distance of the tupels look quite symmetrical which can be interpreted that in general the term length does not change dramatically when terms are translated.
Term lengths English – Japanese
The scatterplot of English and Japanese terms shows a different pattern with some similarities to the previous one. As in the French graph, the dots are scattered around the regression line. The regression line is in this case not running parallel to the diagonal.
Except one occurrence all Japanese terms have less than 20 characters. The bulk of terms is concentrated with 8 or fewer characters as we can derive from the darker dots being placed at 4 < x < 24. No dot is above the diagonal, so that we can conclude: whatever the English term, no Japanese translation will have more characters. However we need to remind ourselves that the required space might be more as Asian characters appear to be wider than the average Latin character.
As the bulk of the data points are scattered along a line we can assume a positive correlation: the longer the English term the longer the Japanese term. This is confirmed by r = 0.73. However, from the plot we get the impression that this progresses in a much slower rate than in the non-Asian languages. This can be stated more precisely by the gradient b of the regression line being computed as b = 0.27.
From Characters to Pixels
Until now, we have been comparing the number of characters in our 11 languages and found some typical and atypical characteristics by applying methods of visual analytics. In order to move on to the initial aim of estimating necessary screen space, we still have to convert the term-length into pixel width used by the characters of the different languages.
For simplicities sake, we assume that all languages based on Latin alphabet (Germanic, Romanic plus Czech) use short letters like „i“ or „t“ and wide letters like „m“ in a similar proportion – as a look on the primary data might be enough right now to justify this assumption. Cyrillic and both Japanese and Chinese seem to have letters that on average need more space. We want to quantify this by measuring the pixel width of the average character ( i.e. letter and space) in each of the alphabets.
We are using a half-manually method: Google Spreadsheet is measuring the pixel-width of a column in pixel (contrary to Excel which is using a different unit, see article Pixel, Point und Zentimeter in Excel), so that we can read the pixel-width from the software instead of measuring this via a pixel based tool like Photoshop. We are picking a few terms from the different languages, based on using Arial with size of 11 which comes visually next in text size to Arial 9pt in Excel. (No idea, why the textsize definitions differ in these two softwares, but the important thing – the ratio between the average character width – remains the same as everything is scaled up or down). After calculating the pixel width of the average characters we get the following values:
We can roughly conclude that the average Cyrillic letter needs 10% more width than the average of the Latin alphabet and that both Asian languages require 110% more width than Latin for their average character. This is quite a change in the game and requires that we move on to adjusted scatterplots with units being converted to pixels.
It will result in Russian term lengths being longer than we have been seen in number of characters. As for Japanese and Chinese, we now can estimate that some of Asian terms are even longer in pixel width than their English counterparts, particularly in the short English terms, and we will see proved in the scatterplots for Japanese and Chinese.
I have converted the scatterplots English – Language Y, which were based on the unit [number of characters] to the unit [pixel size], with their textsize being Arial 11 in Google spreadsheets. From comparing the different scatterplots we can conclude for needed screenspace for balance sheet accounting terms:
- The dominant factor is in general the length of the English term. The term length in any language correlates in a positive linear way with the English term length more or less. The correlation is smallest with German (r = 0.695) and the highest with Romanic languages (for example Spanish with r = 0.906).
- Thus, if you have the choice: start your screendesign with English terms.
Visual estimation of required additional pixel width
My idea is now to add on top of the English term length a particular pixel amount that will cover additional length when the English term is substituted by the translated term. The precise number of this constant varies according to the different languages and the absolute number dependent on the text size. However, the ratio of the constants remain basically the same as long as all the terms in the different languages share the same text size. Recommendations I have been heard like „add 30% of length“ do not reflect the full range of values and don’t accomplish the task particularly in the lower percentile of the values.
The summand is rendered as a green line running parallel to the diagonal. All data points underneath this line will fit into the reserved screen space when the summand is added to the length of the relating English term. In the underlying computation of the scatterplot we can easily apply different values for this constant. Just by looking at the amount of dots above the green constant line we know exactly how many terms would be truncated and we can decide if we accept that or want to apply an ever higher constant.
Let us have a look at the additional pixel widths required when our terms from the primary data set are substituting the English terms:
English > French / Italian / Spanish / Portuguiese / Dutch
For Romanic languages and Dutch, most of the translations will be wider than the English term, but in a proportional way. Using an additional 150 px on top of the pixel with of each English term will make sure that no term of these languages will be cut off.
English > German
If you need to be on the safe side and no truncation can be applied to German terms you should reserve space for all 64 characters respectively 406 px in Arial 9pt. Alternatively we could apply additional 150 px (which is a space of about 21 „n“s in Arial 9 pt) on the cost that a very small part of the terms (those above the green line) will be truncated.
English > Russian
Russian is correlating to English term lengths quite well, but not as good as Romanic languages. We need to add a larger screen space of 190 px to the English term and will have only very few translations been truncated.
English > Czech
To my surprise, the additional space of 150 px being added to the English term length covers almost all of the Czech terms. Merely 11 out of the 386 data points lay above the green line.
English > Japanese
Though Asian letters are much wider than Latin characters, the very large majority of japanese terms still require less space than their English counterparts. By adding 60 px (or 9 „n“s for fontsize independent) on top you make sure that nothing will be truncated. <Remark> The 5 Japanese terms above the green line contain Latin characters which make them less wider in width than calculated for the graph </Remark>
English > Chinese
Basically the same applies as in Japanese, but with fewer outliers.
Some concluding thoughts on the topic, which might include some take-aways
- If you have got all the translations at hand and use scatterplots, then you can precisely compute which space you need to avoid that terms are cut off.
- By applying the method shown above you can literally see how many (and even which!) terms will be truncated if you need to reserve less screen space.
- Correlation rules: The longer the English term, the longer will be the translated term – as a tendency but not in all cases.
- If you do the layout using English terms, your chances are high that the translated term will be longer but not exceedingly longer. At least translating into Romanic languages and Dutch, surely in Japanese and Chinese.
- Different Alphabets are an important factor for required screen space (based on using the same font and textsize).
- Term length can be measured in two different units: number of characters or pixels. Number of char’s keeps you independent from font and font size; using pixel units keeps you tightened to a particular font size. However, when calculating screen space for layouts you might need both units.
Of course, the findings of this article cannot be generalized beyond the domain of balance sheet accounting. However, it can create some indicators for the space required by different languages.
I have shown some methods to calculate very precisely spaces once most of the terms within the one domain are known. This knowledge should be preserved, analysed and documented so that we can it apply the next time a similar situation occurs. If we examine terms in different industries, maybe we can confirm some basic associations we have found above. Future work?
I wonder how companies which are localizing their software are dealing with this problem. Does anyone know more about this?
[ 1] Article „Histogram and Frequency Polygon“ in Harris, Robert L.: Information Graphics – A Comprehensive Illustrated Reference, New York 1999, p. 187ff.
[ 2 ] Article „Correlation Graph“ in Harris, 1999, p. 110 ff.
<Hint>This is THE REFERENCE for Charts and Diagrams. Needs to be on the shelf of every Information Designer.</Hint>
I want to recommend two other books on statistics:
Field, Andy et al.: Discovering Statistics using R, Los Angeles, 2nd edition 2013. A work of epic dimensions. <Warning> This book could be used to kill human beings in two ways: either by throwing the book with a weight of 2.340 kg at them. Or by letting them learn Statistics via R on 956 pages. I think the second death is worse. </Warning>
Hengst, Martin: Einführung in die mathematische Statistik und ihre Anwendung. Mannheim, 1967. <Personal remark> This is a book in the academic serie „BI Hochschultaschenbücher“. Very compact statistical knowledge on university level. I bought it back in 1971, but did not read it at that time. However, when I started discovering statistics four years ago, I was glad about its mathematical and roots-oriented approach. No Excel, no R, no computer, even no Desktop Publishing. At that time all diagrams had to be drawn by hand. It seems to be a bit out-fashioned by now but I like it just for this. There does not seem to be a updated edition. Rather 10 books from 1967 available at Amazon. </Personal Remark>