ada_answers_gem_logo_fourth

Gem #149 : Asserting the truth, but (possibly) not the whole truth

Let's get started...

In the beginning was created Ada. It did not have any assertions. Then came GNAT, which introduced pragma Assert. The ARG saw that it was good, and adopted it in Ada 2005. Then came GNAT again, which introduced pragma Precondition and pragma Postcondition. The ARG saw that they were good too, and adopted them as aspects in Ada 2012. The ARG even tried to beat GNAT at this game, and introduced at the same time aspects for type predicates (see Gems #146 and #147) and type invariants (see Gem #148), which are other forms of assertions. Then came GNAT again, introducing pragmas Assume, Assert_And_Cut, and Loop_Invariant, and aspect Contract_Cases, yet other forms of assertions.

So now the Ada programmer has a rich set of assertions to state control-relevant properties (Assert, Pre, Post, Loop_Invariant, Assume, Assert_And_Cut) and data-relevant properties (Static_Predicate, Dynamic_Predicate, Type_Invariant).

How does one state which assertions get executed? And how does one differentiate between different executables, say, between one created for debugging/testing, and one created for production?

GNAT provides a switch -gnata that enables all assertions: pragma Assert of course, but also all the newer forms of assertions presented above. So each unit can be independently compiled with or without assertions. But that's not always adequate.

Let's take the example of writing a library. We want to use preconditions to prevent the library from being called in an invalid context (defensive programming), and postconditions plus type predicates to help with debugging and maintenance of the library (assertion-based verification). Here is the code:

package Library is
   type Status is (None, Acquired, Released);

   type Resource is record
      Id   : Integer;
      Stat : Status;
   end record
     with Dynamic_Predicate =>
       (if Resource.Id = 0 then Resource.Stat = None
        else Resource.Stat /= None);

   No_Resource : constant Resource := Resource'(0, None);

   procedure Get (R : in out Resource; Id : Integer) with
     Pre  => R.Stat = None,
     Post => R.Stat = Acquired;

   procedure Free (R : in out Resource) with
     Post => (if R.Stat'Old = Acquired then R.Stat = Released);
end Library;
package body Library is
   procedure Get (R : in out Resource; Id : Integer) is
   begin
      R.Stat := Acquired;
      R.Id   := Id;
   end Get;

   procedure Free (R : in out Resource) is
   begin
      if R.Stat /= Acquired then
         return;
      end if;
      R.Stat := Released;
   end Free;
end Library;

When this code is compiled with the switch -gnata, each call to Get incurs four run-time assertions (and calls to Free have three):

  • a precondition check on subprogram entry
  • a postcondition check on subprogram exit
  • a predicate check for parameter R on subprogram entry
  • a predicate check for parameter R on subprogram exit

That's fine during testing and debugging (when we use -gnata), but we'd like the production code to only contain run-time assertions for the preconditions, to catch misuse of the library in the actual product, while avoiding the overhead of the other checks.

Ada 2012 provides pragma Assertion_Policy for that purpose. This pragma can take the name of an assertion aspect/pragma as first argument, and the desired policy for that aspect as second argument. To enforce checking of preconditions even when -gnata is not used, one only has to include the following line at the start of library.ads:

pragma Assertion_Policy (Pre => Check);

Now, any misuse of the library by client code will be detected, no matter how the library is compiled. Take for example a program that fails to release the resource between two calls to Get:

with Library; use Library;
procedure Client is
   R : Resource := No_Resource;
begin
   Get (R, 1);
   Get (R, 2);  -- incorrect
end Client;

This code (and the library code) can now be compiled without -gnata:

$ gnatmake client.adb
gcc -c client.adb
gcc -c library.adb
gnatbind -x client.ali
gnatlink client.ali

And it still raises an error at run time:

$ ./client
raised SYSTEM.ASSERTIONS.ASSERT_FAILURE : failed precondition from library.ads:16

For more information on pragma Assertion_Policy, or the new assertion pragmas/aspects supported by GNAT, see the GNAT Pro Reference Manual.

And as Tony Hoare puts it: "Assert early and assert often!"

Yannick Moy
AdaCore

Yannick Moy’s work focuses on software source code analysis, mostly to detect bugs or verify safety/security properties. Yannick previously worked for PolySpace (now The MathWorks) where he started the project C++ Verifier. He then joined INRIA Research Labs/Orange Labs in France to carry out a PhD on automatic modular static safety checking for C programs. Yannick joined AdaCore in 2009, after a short internship at Microsoft Research.

Yannick holds an engineering degree from the Ecole Polytechnique, an MSc from Stanford University and a PhD from Université Paris-Sud. He is a Siebel Scholar.

Gem #148 : Su(per)btypes in Ada 2012 - Part 3

In the previous two Gems, we saw how aspects Static_Predicate and Dynamic_Predicate can be used to state properties of objects that should be respected at all times. This third and final Gem in the series is concerned with an aspect called Type_Invariant.

The Type_Invariant aspect can be used with private types, to define a property that all objects of the types should respect outside of the package where the types are declared. Take for example a type Communication storing the messages between various parties, based on the Message type used in the previous Gem:

package Communications is
   type Message_Arr is array (Integer range <>) of Message;
   type Communication (Num : Positive) is private;
private
   type Communication (Num : Positive) is record
      Msgs : Message_Arr (1 .. Num);
   end record;
end Communications;

To state that messages should be ordered by date of reception, we can add the aspect to the full type:

type Communication (Num : Positive) is record
   Msgs : Message_Arr (1 .. Num);
end record with
  Type_Invariant =>
    (for all Idx in 1 .. Communication.Num-1 =>
      Communication.Msgs(Idx).Received <= Communication.Msgs(Idx+1).Received);

The compiler will insert run-time checks to ensure that this property holds at prescribed locations in the code:

  • at object initialization (including by default!)
  • on conversions to the type
  • when returning an object from a public function defined in the type's package
  • on out and in out parameters, when returning from a public procedure of the type's package

For example, consider the following incorrect code that fails to initialize Com to a correct value satisfying the invariant:

Com : Communication (2);  -- incorrect

Compiling it with assertions and running it leads to the following error:

raised SYSTEM.ASSERTIONS.ASSERT_FAILURE : failed invariant from communications.ads:16

But if we give the object an explicit value, through a creation function Create defined in unit Communications, then the object declaration is elaborated without errors:

Coms : Communication (2) := Create (A);

Inside the Create function, the initialization of Coms must respect the invariant, but after that, the invariant could be violated between the time Coms is declared, and the time it is returned.

function Create (A : Message_Arr) return Communication is
   Coms : Communication := (Num => A'Length, Msgs => A);
begin
   -- statements before the return might violate the invariant
   return Coms;
end Create;

Ada requires that the type invariant be checked on every part of a parameter that has type Communication, where a part can be a component of a record, or an element of an array, or any such combination. For example, it is checked on every element of the array returned by Create_N or potentially modified by Update_N:

type Communication_Arr is array (Integer range <>) of Communication;
function Create_N return Communication;
procedure Update_N (A : in out Communication_Arr);

Importantly, the invariant is not checked on subprograms declared in the private part or in the package body. These subprograms are internal operations, and should be callable on objects whose invariant does not hold. Likewise, the invariant is not checked on parameters of mode in, for example on query functions used in the definition of the type invariant itself. This is fortunate, since otherwise this would easily cause infinite loops!

As a side note, it's worth mentioning that GNAT also provides an aspect with the name Invariant, which is a synonym for the Type_Invariant aspect (and implemented before Type_Invariant appeared in Ada 2012).

This Gem ends the series of three Gems on su(per)btypes in Ada. Together with Static_Predicate and Dynamic_Predicate, Type_Invariant provides new ways to state properties of your data, both in new and existing programs, so try them out!

Yannick Moy
AdaCore

Yannick Moy’s work focuses on software source code analysis, mostly to detect bugs or verify safety/security properties. Yannick previously worked for PolySpace (now The MathWorks) where he started the project C++ Verifier. He then joined INRIA Research Labs/Orange Labs in France to carry out a PhD on automatic modular static safety checking for C programs. Yannick joined AdaCore in 2009, after a short internship at Microsoft Research.

Yannick holds an engineering degree from the Ecole Polytechnique, an MSc from Stanford University and a PhD from Université Paris-Sud. He is a Siebel Scholar.

Gem #147 : Su(per)btypes in Ada 2012 - Part 2

Let's get started...

The previous Gem in this series showed how the aspect Static_Predicate can be used to state properties of scalar objects that should be respected at all times. This Gem is concerned with the Dynamic_Predicate aspect, which can be used on all type and subtype declarations (not just scalar ones).

Consider for example a type Message encoding the dates when a message was sent and received, where dates are represented by strings, such as "1789-07-14" for the fourteenth of July 1789:

type Day is new String (1 .. 10);

type Message is record
   Sent     : Day;
   Received : Day;
end record;

To state that a message reception date should always be greater than the date it was sent, we can write:

type Message is record
   Sent     : Day;
   Received : Day;
end record with
  Dynamic_Predicate => Message.Sent <= Message.Received;

Note that the type name itself is used as a prefix of the components named in the predicate. In this context the name of the type denotes what Ada calls the current instance of the type, which at run time will denote the actual object the predicate is applied to.

In contrast to Static_Predicate, the compiler cannot determine in general if a Dynamic_Predicate will fail, so it inserts run-time checks at certain required locations in the code:

  • when assigning to a variable of the subtype
  • when passing an input parameter of the subtype
  • when returning an output parameter to an object of the subtype
  • when converting a value to the subtype

For example, on the following incorrect code:

M : Message := (Received => "1776-07-04", Sent => "1783-09-03");  --  incorrect

Compiling it with assertions and running it leads to the following error:

raised SYSTEM.ASSERTIONS.ASSERT_FAILURE : Dynamic_Predicate failed at main.adb:3

If the values of the Sent and Received components are corrected to reflect the actual event ordering of the proclamation of the Independence of the United States and the date of the treaty of Paris ending the American Revolutionary War, then the generated code executes without errors.

Beware that no run-time checks are inserted when assigning to individual components, so the predicate can be silently violated between assignments and calls. For example, if the definition above separately assigns each component of M, even if the value for Received and Sent are appropriately ordered:

   M : Message;  --  incorrect
begin
   M.Received := "1783-09-03";  --  incorrect
   M.Sent := "1776-07-04";      --  predicate is correct here

This code does not lead to a run-time failure, but if we pass the message before it is completely initialized to some procedure Process taking it as input parameter:

procedure Process (M : Message);

Compiling the resulting code with assertions and running it again leads to an error:

raised SYSTEM.ASSERTIONS.ASSERT_FAILURE : Dynamic_Predicate failed at main.adb:7

Note that Dynamic_Predicate is more flexible than Static_Predicate: it can be applied to more forms of types and more general predicate expressions. For example, the mod operator is not allowed outside a static expression in a Static_Predicate, so the type of odd numbers must be defined with a Dynamic_Predicate:

subtype Odd is Integer with Dynamic_Predicate => Odd mod 2 = 1;

Likewise, a user function can be called in a Dynamic_Predicate, but not in a Static_Predicate.

GNAT conveniently provides an aspect Predicate that can be used anywhere a Dynamic_Predicate is allowed, and analyzes it as a Static_Predicate when possible.

In the next and final Gem in this series on type and subtype contracts we'll look at a related aspect called Type_Invariant.

Yannick Moy
AdaCore

Yannick Moy’s work focuses on software source code analysis, mostly to detect bugs or verify safety/security properties. Yannick previously worked for PolySpace (now The MathWorks) where he started the project C++ Verifier. He then joined INRIA Research Labs/Orange Labs in France to carry out a PhD on automatic modular static safety checking for C programs. Yannick joined AdaCore in 2009, after a short internship at Microsoft Research.

Yannick holds an engineering degree from the Ecole Polytechnique, an MSc from Stanford University and a PhD from Université Paris-Sud. He is a Siebel Scholar.

Gem #146 : Su(per)btypes in Ada 2012 - Part 1

Let's get started...

Ada 2012 is full of features for specifying a rich set of type properties. In this series of three Gems, we describe three aspects that can be used to state invariant properties of types and subtypes. This first Gem is concerned with the Static_Predicate aspect.

Static_Predicate can be specified on scalar types and subtype definitions to state a property that all objects of the subtype must respect at all times. Take for example a type Day representing the days of the week:

type Day is (Monday, Tuesday, Wednesday, Thursday, Friday, Saturday, Sunday);

To state that T_Day is the (sub)type of days whose name starts with a 'T', we can write:

type T_Day is new Day with Static_Predicate => T_Day in Tuesday | Thursday;

or

subtype T_Day is Day with Static_Predicate => T_Day in Tuesday | Thursday;

Now the compiler will warn about a program that assigns a value statically known to be different from Tuesday or Thursday to a T_Day object.

We'll proceed with using the second definition above. For example, on this incorrect code:

D : T_Day := Day'First; -- Incorrect

GNAT generates the following warning at compile time:

>>> warning: static expression fails static predicate check on "T_Day"

The compiler also checks the completeness of case expressions and case statements involving T_Day arguments. For example, on this code:

   case D is
      when Tuesday => ...
      when Friday => ...  -- Incorrect
   end case;

GNAT generates the following errors at compile time:

>>> missing case value: "Thursday"
>>> static predicate on "T_Day" excludes value "Friday"

If Friday is replaced by the correct value Thursday, then the code compiles quietly.

Finally, the compiler generates run-time checks for any erroneous write of a day other than Tuesday or Thursday in an object of type T_Day, which makes it easy to detect violations of the predicate of a type as soon as it occurs! Note that to enable run-time checking of Static_Predicate (and other kinds of assertions specified by aspects) it's necessary to compile with the switch -gnata (or else enable assertion checking with the pragma Assertion_Policy).

For example, suppose we have a procedure Next that advances its argument to the next day, and we want to define a similar procedure, Next_T, that advances its argument of subtype T_Day. Here's the definition of procedure Day:

procedure Next (D : in out Day) is
begin
   if D = Sunday then
      D := Monday;
   else
      D := Day'Succ (D);
   end if;
end Next;

Following is a failed attempt at defining Next_T:

procedure Next_T (D : in out T_Day) is
begin
   Next (D); -- Incorrect
   while D not in T_Day loop
      Next (D);
   end loop;
end Next_T;

Let's add a test of this code:

with Days; use Days;
procedure Main is
   D : T_Day := Tuesday;
begin
   Next_T (D);
end Main;

When this code is compiled with assertions enabled (-gnata) and run, it issues a run-time error:

raised SYSTEM.ASSERTIONS.ASSERT_FAILURE : Static_Predicate failed at days.adb:3

This points to the first line where Next is called in Next_T. Indeed, on entry to Next_T, the value of D is Tuesday, so Next returns Wednesday, which does not satisfy the Static_Predicate of T_Day, but is assigned to a T_Day, hence triggering a run-time error. The correct version of Next_T uses a temporary variable of type T_Day'Base, which strips off all constraints from T_Day, including the predicate if present:

procedure Next_T (D : in out T_Day) is
   Tmp : T_Day'Base := D;
begin
   Next (Tmp);
   while Tmp not in T_Day loop
      Next (Tmp);
   end loop;
   D := Tmp;
end Next_T;

In the next Gem in this series we'll see how to use a related aspect called Dynamic_Predicate.

Yannick Moy
AdaCore

Yannick Moy’s work focuses on software source code analysis, mostly to detect bugs or verify safety/security properties. Yannick previously worked for PolySpace (now The MathWorks) where he started the project C++ Verifier. He then joined INRIA Research Labs/Orange Labs in France to carry out a PhD on automatic modular static safety checking for C programs. Yannick joined AdaCore in 2009, after a short internship at Microsoft Research.

Yannick holds an engineering degree from the Ecole Polytechnique, an MSc from Stanford University and a PhD from Université Paris-Sud. He is a Siebel Scholar.

Gem #145: Ada Quiz 3 - Statements

Let's get started...

Ada is an imperative programming language, where the sequentially executed statement is a building block of the language (together with declarations). This Gem presents nine short questions on Ada statements. Try to answer them without using the compiler.

Q1 - Is there a compilation error?

if A == 0 then
   Put_Line ("A is 0");
end if;

 

Q2 - Is there a compilation error?

if A := 0 then
   Put_Line ("A has been assigned the value zero");
end if;

 

Q3 - Is there a compilation error?

declare
   A : Integer := Integer'Value (Get_Line);
begin
   case A is
      when 1 .. 9 =>
         Put_Line ("Simple digit");
      when 10 .. Integer'Last =>
         Put_Line ("Long positive");
      when Integer'First .. -1 =>
         Put_Line ("Negative");
   end case;
end;

 

Q4 - Is there a compilation error?

declare
   A : Integer := Integer'Value (Get_Line);
begin
   case A is
      when Positive =>
         Put_Line ("Positive");
      when Natural =>
         Put_Line ("Natural");
      when others =>
         Put_Line ("Other");
  end case;
end;

 

Q5 - Is there a compilation error?

declare
   A : Float :=  10.0;
begin
  case A is
     when 1.0 .. Float'Last =>
        Put_Line ("Positive");
     when Float'First .. -1.0 =>
        Put_Line ("Negative");
     when others =>
        Put_Line ("others");
  end case;
end;

 

Q6 - Is there a compilation error?

for Index in 0 .. 10 loop
   Index := 10;
end loop;

 

Q7 - What is the output of this code?

Put_Line ('Before the loop');
for Index in 10 .. 0 loop
   Put_Line (Integer'Image (Index));
end loop;
Put_Line ('After the loop');

 

Q8 - Is there a compilation error?

if A != 0 then
   Put_Line ("A is not 0");
end if;

 

Q9 - What is the output of this code?

declare
   Index : Integer := 20;
begin
   for Index in 1 .. 5 loop
      Put_Line (Integer'Image(Index));
   end loop;
   Put_Line (Integer'Image(Index));
end;

 

Q10 - What is the output of this code?

declare
  X : Integer := 2;
begin
   for I in 1 .. X loop
      X := 10;
      Put_Line ("One loop iteration");
   end loop;
end;

Answers:

A1 - Compilation error: The Ada equality symbol is "=", not "==".

A2 - Compilation error: Assignment is not an operator in Ada. Therefore it can never be used in an expression or Boolean condition.

A3 - Compilation error: The covered intervals are Integer'first to -1, 1 to 9, and 10 to Integer'last. Obviously zero is missing. However, all values covered by the subtype of the case expression must be covered by the case statement alternatives. The compiler will complain about the missing value.

A4 - Compilation error: Positive and Natural are subtypes defined in the predefined package Standard as follows:

subtype Natural is Integer range 0 .. Integer'Last;

subtype Positive is Integer range 1 .. Integer'Last;

Their ranges overlap, so cannot be used together in a case statement, since each value covered by the choices in a case statement must occur only once.

A5 - Compilation error: Float is not a discrete type, so it cannot be used for the type of the expression of a case statement.

Summarizing the answers to questions 3, 4, and 5, in a case statement, each value belonging to the subtype of the expression, which must be a discrete, static subtype, must be covered once and only once by the case alternative.

A6 - Compilation error

In an Ada for loop, the loop_parameter is a constant: it cannot be updated within the sequence of statements of the loop.

A7 - The output is:

	Before the loop
	After the loop 

Nothing is printed during execution of the loop itself, because the range 10 .. 0 is empty, so the loop doesn't loop!

The correct way to get the 11 numbers printed from 10 to 0 is to use a reverse loop:

for Index in reverse 0 .. 10 loop
   Put_Line (Integer'Image (Index));
end loop;

A8 - Compilation error: The Ada inequality symbol is "/=", not "!=".

A9 - The output is:

  1
  2
  3
  4
  5
  20

It's unnecessary to declare Index before using it as a for loop index. The for loop effectively declares its own loop variable, which will hide any outer object with the same name.

A10 - The output is:

One loop iteration
One loop iteration

The range of the loop parameter is determined once, at the start of the loop. The modification of X after that point has no effect on the number of loop iterations.

 

Valentine Reboul
AdaCore

Valentine Reboul joined AdaCore in 2012 after 3 years of experience in critical systems (Air Traffic Flow Management and Railway automation solutions). She now participates in the "Qualifying Machine" research project and is involved in training sessions given about Ada Language. She holds an engineering degree from the Ecole Nationale Supérieure d'Informatique et de Mathématiques Appliquées (Grenoble, France).

Gem #144: A Bit of Bytes: Characters and Encoding Schemes

Let's get started...

This Gem starts with a problem. As a French native, I often manipulate text files that contain accented letters (those accents, by the way, were often introduced as a shorthand to replace letters in words, to save paper when it was still an expensive commodity). Unfortunately, depending on how the file was created, my programs do not necessarily see the same byte contents (which depends on the encoding and the character set of the file), and, if I just try to display them on the screen (either in a text console, or in a graphical window), the output might not read like what I initially entered.

Glyphs

At this point, let's introduce the notion of glyphs. These are the visual representations of characters. For instance, I want "e-acute" to look like an "e" with a small acute accent above it. This visual representation is the final goal in a lot of applications, since that's what the user wants to see. In other applications, however, the glyphs are irrelevant. For instance, a compiler does not care how characters are displayed on your screen. It needs to know how to split a sequence of characters into words, but that's about it. It assumes your console, where error messages are displayed, will display the same glyphs you had in your source file when given the same bytes as the source file itself.

A text file does not embed the description of what its representation looks like. Instead, it is composed of bytes, which are combined in certain ways (sometimes called character encoding schemes) to make up code points. These code points are then matched to a specific character using a character set. Finally, the font determines how the character should be represented as a glyph.

A character's exact representation (its glyph) really depends on the font you are using, since a "lower case a" might have widely different aspects that depend on the font. This is outside the scope of this Gem, though.

In general, your application is not concerned with the mapping of characters to glyphs via the font. This is all taken care of by either the text console, or the GUI toolkit you are using. Your application will often let the user choose her preferred font, and then make sure to pass valid characters. The toolkit does the complex work of representing the characters. For example, this work is the role of the Pango toolkit (accessible from GtkAda).

Character Sets

A repertoire is a set of generally related characters, for instance the alphabets used to spell English or Russian words.

A character set is a mapping from a repertoire to a set of integers called code points. A given character, as we shall see, might exist in several different character sets with different code points.

Most of the standard character sets (sometimes abbreviated as charsets) are specific to one language. For instance, there exist ISO-8859-1 (also known as Latin-1) and ISO-8859-15, which are used for West European languages; we also have ISO-8859-5 and KOI8-R, which are different, but both used for Russian; Windows introduced a number of code pages, which are in fact character sets specific to that platform; Japanese texts often use ISO-2022-JP, whereas Chinese has several standard sets.

Let's take the simplest of them all, the ASCII charset. Most developers are familiar with it. For instance, in this set the code point 65 is associated with the letter upper-case-A. This set includes 128 characters, 31 of which have no visual representation. It contains no accented letters, but is basically appropriate for representing English texts.

In a lot of Western European languages, like French, ASCII was not sufficient, so ISO-8859-1 was built on top of it. The first 128 characters are the same, so code point 65 is still upper-case-A. But it also adds 128 extra characters, for instance 233 is lower-case-e-with-acute. See the Wikipedia page on ISO-8859-1 for more details.

Another example is ISO-8859-5, for Russian text, which is incompatible with ISO-8859-1, although it is also based on ASCII. So 65 is still upper-case-A, but this time 233 is cyrillic-small-letter-shcha and lower-case-e-with-acute does not exist.

As a result, if an application is reading an ISO-8859-5 encoded file, but believes it is ISO-8859-1, it will display an invalid glyph for most of the Russian letters, obviously making the text unreadable for the user.

In most applications (for instance, the GPS IDE), there is a way to specify which character set the application should expect the files to be encoded in by default, and a way to override the default encoding for specific files.

There exists one character set that includes all characters that exist in all the other character sets (or at least is meant to), and this is Unicode (somewhat akin to ISO-10646). It includes thousands upon thousands of characters (and more are added at each revision), while avoiding duplicates. For compatibility with a lot of existing applications, the first 256 characters are the same as in ISO-8859-1, so upper-case-A is still 63, and lower-case-e-with-acute is still 233. But now cyrillic-small-letter-shcha is 1097.

Nowadays, a lot of applications (and even programming languages) will systematically use Unicode internally. For instance, the GTK+ graphic toolkit only manipulates Unicode for internal strings, and so does Python 3.x. So whenever a file is read from disk by GPS, it is first converted from its native character set to Unicode, and then the rest of the application no longer has to care about character sets.

Given the size of Unicode, there are few (if any) fonts that can represent the whole set of characters, but that's not an issue in general since most applications do not need to represent Egyptian hieroglyphs...

Another major part of the Unicode standard is a set of tables to find the properties of various characters: which ones should be considered as white space, how to convert from lower to upper case, which letters are part of words, etc. This knowledge is often hard-coded in our applications and often involves a major change when an application decides to use Unicode internally.

Character Encoding Schemes

We now know how to represent characters as a combination of code points and a character set. But we often need to store those characters in files, which only contain bytes. That seems relatively easy when the code point is less than 256, but becomes much less obvious for other code points, like the 1097 we saw earlier.

In practice, this issue is solved in a number of ways. Encoding schemes such as the Japanese ISO-2022-JP use a notion of plane shift: special bytes indicate that from now on the bytes should be interpreted differently, until the next plane shift. Decoding and encoding therefore requires knowledge of the current state.

Unicode itself defines three different encoding schemes (with their variants), which are known as UTF-8, UTF-16, and UTF-32. The last number indicates the number of bits that each character is encoded in. Therefore, in UTF-32, each character occupies four bytes, which allows the whole set of Unicode characters to be represented. Decoding and encoding is therefore trivial, but there is a major waste of space associated with UTF-32.

In UTF-16, each character is encoded in two bytes, which is enough for all characters used by spoken languages. Other characters are for specific usage, like Egyptian hieroglyphs. For code points that do not fit in two bytes, Unicode defines a few special bytes (the surrogate pairs) that are similar to the plane shifts we described earlier. Thus, there is much less wasted space, but decoding and encoding becomes a bit more complex.

The above two encoding schemes are not backward compatible: an application that was written before Unicode and that only knew about ASCII or ISO-8859-1 will not understand the input strings properly.

For this reason, and to save even more space, Unicode also defines the UTF-8 encoding. For all ASCII characters, they are still represented as before using a single byte. Characters greater than 127 are encoded as a sequence of several bytes (and it is guaranteed that all bytes but the last are not part of ASCII).

Properly manipulating a UTF-8 string requires the use of specialized routines (since moving forward one character means moving forward 1 to 6 bytes). However, a casual application can, for instance, skip to the next white space character as it did before by moving forward one byte at a time and stopping when it sees 32 (a space) or 13 (a newline). This property can often be used by applications that do not need to represent the characters, like the example of the compiler we mentioned at the beginning.

Although the notions of character sets and character encoding schemes are orthogonal, often these notions are conflated. For instance, when someone mentions ISO-8859-1, it usually means the character set as well as its standard representation, where each character is represented as a single byte. Likewise, someone talking about UTF-8 will typically mean the Unicode character set together with the UTF-8 character encoding scheme.

Conversions

We now have almost all of the pieces in place, except for the conversion between character sets. In theory, it is enough to decode the input stream using the proper character encoding scheme, then find the mapping for the code points from the origin to the target character set, and finally use the target encoding scheme to represent the characters as bytes again.

When a character has no mapping into the target character set (for instance the e-acute in the Russian iso-8859-5), the application needs to decide whether to raise an error, ignore the character, or find a transliteration (for example, using e'  for e-acute).

This is obviously tedious, and requires the use of big lookup tables for all the character sets your application needs to support.

On Unix systems, there exists a standard library, iconv, to do this conversion work on your behalf. The GNU project also provides such an open-source library for other systems.

We have recently added a binding to this library in the GNAT Components Collection (GNATCOLL.Iconv), making it even easier to use from Ada. For instance:

    with GNATCOLL.Iconv;   use GNATCOLL.Iconv;
    procedure Main is
       EAcute : constant Character := Character'Val (16#E9#);
       --  in ISO-8859-1

       Result : constant String := Iconv
          ("Some string " & EAcute,
           To_Code   => UTF8,
           From_Code => ISO_8859_1);
    begin
       null;
    end Main;

XML/Ada has also included such conversion tables for a while, but supports many fewer character sets. Check the Unicode.CSS.* packages.

As you can see above, we are reusing the string type, since, in Ada, a string is not associated with any specific character set or encoding scheme. In general, as we mentioned before, this is not an issue, since an application will use a single encoding internally (UTF-8 in most cases). Another approach is to use a Wide_String or Wide_Wide_String. The same comment as for UTF-16 and UTF-32 applies: these make character manipulation more convenient, but at the cost of wasted memory.

Manipulating UTF-8 and UTF-16 strings

The last piece of the puzzle, once we have a Unicode string in memory, is to find each character in it. This requires specialized subprograms, since the number of bytes is variable for each character.

XML/Ada's Unicode module includes such a set of subprograms in its Unicode.CES.* packages. In general, going forward is relatively easy and can be done efficiently, whereas going backward in a string is more complex and less efficient.

The GNAT run-time library also contains such packages, for instance GNAT.Encode_UTF8_String and GNAT.Decode_UTF8_String. In particular, the latter provides Decode_Wide_Character, Next_Wide_Character, and Prev_Wide_Character, to find all the characters in a string.

Emmanuel Briot
AdaCore

Emmanuel Briot has been with AdaCore since 1998. He has been involved in a variety of projects, in particular oriented towards graphical user interfaces, including GtkAda, GPS, XML/Ada, GnatTracker and our internal CRM. He holds an engineering degree from the Ecole Nationale des Telecommunications (Brest, France).

Gem #143 : Return to the Sources

Let's get started...

It all starts from the source.

All large applications organize their source code into multiple separate directories, which we generally think of as modules. The source files themselves generally follow naming conventions so that we can easily find things. For instance, the traditional extension for Ada files (in GNAT) are .adb and .ads, although other technologies use other extensions (.1.ada, for instance).

A lot of tools, in particular the compiler and the IDE, need to find source files in order to perform various actions on the code. Once they have found the sources, though, they also need to know how to manipulate them. For instance, the compiler might need to compile a specific file with style checks turned off, whereas all other files need style checks enabled, to ensure style consistency.

A typical application, nowadays, uses multiple languages, such as Ada and C. Each language has its own naming scheme, and perhaps its own set of tools.

At the beginning, GNAT was using switches like -I to point to the various source directories, and was expecting all source files to use .adb and .ads suffixes. But then we introduced project files, which serve as a convenient place to describe the organization of software projects. In contrast to a Makefile, they are purely descriptive and do not describe the set of actions to perform. This makes them ideally suited for sharing among multiple tools.

Other Gems have already talked about various aspects of projects, so we will not go into the details here.

However, it might happen that your own application could use what's in the project files. Parsing those efficiently is tricky, since we keep adding features to support various aspects of managing sources, and the parser would have to be kept up to date.

Instead, we recommend using the package GNATCOLL.Projects, found in the GNAT Components Collection, to manipulate project files. This Gem presents a brief introduction to the features of this package.

Here's a simple example of use:

   pragma Ada_05;
   with GNATCOLL.Projects;   use GNATCOLL.Projects;
   with GNATCOLL.VFS;        use GNATCOLL.VFS;  --  Gem 118
   ...
   declare
      Tree : Project_Tree;
   begin
      Tree.Load (GNATCOLL.VFS.Create ("root.gpr"));
   end;

There is often confusion among terms. Let's give a few definitions that are used within the GNATCOLL API. A project tree is a set of projects that might depend on each other. You can think of the tree as representing your whole application or source base. It is generally subdivided into modules, each of which contains a single project file.

In the example above, what we loaded is the tree rooted at root.gpr. Thus, we loaded the project root.gpr, but also perhaps the project child.gpr on which root.gpr depends.

Environment

In fact, the example above is often simplistic. A project can be configured for multiple scenarios (by using the "external" keyword in the project file, and then some case statements, for instance to change the list of source directories depending on an environment variable).

Likewise, your project might depend on some preinstalled projects. For instance, if you intend to use GNATCOLL.Projects, your project will likely depend on gnatcoll.gpr. To find these projects, GNATCOLL will by default ask gnatls where it thinks the predefined projects are, or what the run-time directory is. But you can also add your own.

To do this, you need to go through an instance of Project_Environment, as in the following code:


   declare
      Env : Project_Environment_Access;
   begin
      Initialize (Env);
      Env.Set_Predefined_Source_Path ((1 => Create ("/usr/local/prefix")));

      --  add a custom language
      Env.Register_Default_Language_Extension ("python", ".py", "");

      --  set up scenario variables
      Env.Change_Environment ("VARIABLE", "VALUE");

      Tree.Load (Create ("root.gpr"), Env => Env);
   end;

This time, the project is loaded in a specific, preinitialized context, which might affect the view the application has of it.

If you need to change the scenario during the lifetime of your application, you would do the following:

    Env.Change_Environment ("VARIABLE", "VALUE2");
    Tree.Recompute_View;

which reloads the same project, in a different scenario. Now, for instance, the list of source files or compiler switches could be different.

Queries

Once we have the projects loaded in memory, we need to perform queries to extract information.

First of all, let's find the list of all source files in the application.

   pragma Ada_12;   --  convenient for iterators
   ...
   declare
      Src : File_Array_Access :=
              Tree.Root_Project.Source_Files (Recursive => True);
   begin
      for S of Src loop
         Put_Line (S.Display_Full_Name);
      end loop;
      Free (Src);
   end;

The source files are returned as instances of a Virtual_File. As we saw in Gem 118, such an object provides a convenient cache for information that otherwise would need to be queried via system calls, which can be slow on some systems. Also, it doesn't presume whether you are going to be needing full path, basenames, or other information. These objects are cached in the project tree, so that every time you request source files, the same instances (and its cache) are returned.

A frequent operation is that you have a base name for a source file (for instance a.adb), and want to find it on disk. This can easily be achieved with:

   declare
      A_Adb : constant Virtual_File := Tree.Create ("a.adb");
   begin
      Put_Line (A_Adb.Display_Full_Name);
   end;

Again, this information is cached, so is very fast to query.

Naming schemes

As much as possible, tools should be able to handle multiple source languages. From a source file, we therefore need to know its language, which can be done with:

   Put_Line (Tree.Info (A_Adb).Language);    --  "ada"

Ada, in particular, also has the notion of units. For instance, the unit GNATCOLL.Projects is in the source file "gnatcoll-projects.ads". The mapping from one to the other is fully described in the project file, and is not something that each application should assume, or recompile on its own.

From a source file, retrieving the name of the unit is done with:

   Put_Line (Tree.Info (A_Adb).Unit_Name);   --  "A"
   Put_Line (Tree.Info (A_Adb).Unit_Part);   --  Unit_Body

From the unit, finding the source file is done with:

   Put_Line (Tree.Root_Project ("A", Unit_Body, "Ada"));  --  "a.adb"

Attributes

The information in a project file is organized into packages (typically one for each tool like the compiler, binder, IDE,...), and then into attributes. Users are free to add their own packages, so you could decide that your own tool's configuration should go into the package My_Tool, and which attributes can be used for that configuration. You should not however add new attributes to the predefined packages, since the project parser will complain otherwise, to avoid possible future name clashes.

A typical attribute is Switches, which specifies the command-line switches to pass to your tool. So the user's project file could contain:

  project Root is
     package My_Tool is
        for Switches ("Ada") use ("-a", "-b");
     end My_Tool;
  end Root;

From Ada, you can retrieve the value of this attribute with the following piece of code.

   pragma Ada_12;
   with GNAT.Strings;   use GNAT.Strings;
   ...
   declare
      My_Tool_Switches : constant Attribute_Pkg_List :=
         Build (Package_Name => "My_Tool", Attribute_Name => "Switches");

      Switches : GNAT.Strings.String_List_Access :=
         Tree.Root_Project.Attribute_Value (My_Tool_Switches, Index => "Ada");
   begin
      for S of Switches loop
         Put_Line (S.all);   --  "-a", then "-b"
      end loop;

      Free (Switches);
   end;

Some constants for the predefined attributes are already declared in GNATCOLL.Projects.

GNATCOLL.Projects also provides an API to edit project files. It has the same limitation as GPS does (not surprisingly): when you edit a project, the changes might impact the whole project file, so your handcrafted formatting or comments might disappear. Mostly, you should only edit projects that were created through the same API, at the risk of otherwise losing user changes.

GNATCOLL.Projects is able to parse all projects, even those it cannot edit later on, with the notable exception of aggregate projects. Support for those is on the roadmap, but hasn't been implemented yet.

Emmanuel Briot
AdaCore

Emmanuel Briot has been with AdaCore since 1998. He has been involved in a variety of projects, in particular oriented towards graphical user interfaces, including GtkAda, GPS, XML/Ada, GnatTracker and our internal CRM. He holds an engineering degree from the Ecole Nationale des Telecommunications (Brest, France).

Gem #142 : Exception-ally

The standard Ada run-time library provides the package Ada.Exceptions. This package provides a number of services to help analyze exceptions.

Each exception is associated with a (short) message that can be set by the code that raises the exception. In more recent versions of Ada, these messages can be set very simply, as in the following code:

    Ada.Exceptions.Raise_Exception         --  Ada 95
        (Constraint_Error'Identity, "some message");

    raise Constraint_Error with "some message";  --  Ada 2005

The new syntax is now very convenient, and developers should be encouraged to provide as much information as possible along with the exception. However, the length of the message is limited to 200 characters by default in GNAT, and messages longer than that will be truncated.

Exceptions also embed information set by the run-time itself that can be retrieved by calling the Exception_Information function. In the case of GNAT, this information might include the source location where the exception was raised and a nonsymbolic traceback. The function Exception_Information also displays the Exception_Message. Here is a typical exception handler that catches all unexpected exceptions in the application:

   pragma Ada_05;
   with Ada.Exceptions;
   with Ada.Text_IO;   use Ada.Text_IO;

   procedure Main is

       procedure Nested is
       begin
           raise Constraint_Error with "some message";
       end Nested;

   begin
       Nested;

   exception
       when E : others =>
           Put_Line (Ada.Exceptions.Exception_Information (E));
   end Main;

Let's now compile this application with no special command-line option.

  > gnatmake main.adb
  > ./main

Exception name: CONSTRAINT_ERROR
Message: some message

That's not very informative. To get more information, we need to rerun the program in the debugger. To make the session more interesting though, we should add debug information in the executable, which means using the -g switch in the gnatmake command.

The session would look like the following (omitting some of the output from the debugger):

  > rm *.o      # Cleanup previous compilation
  > gnatmake -g main.adb
  > gdb ./main
  (gdb)  catch exception
  (gdb)  run
Catchpoint 1, CONSTRAINT_ERROR at 0x0000000000402860 in main.nested () at main.adb:8
8               raise Constraint_Error with "some message";

  (gdb) bt
#0  <__gnat_debug_raise_exception> (e=0x62ec40 <constraint_error>) at s-excdeb.adb:43
#1  0x000000000040426f in ada.exceptions.complete_occurrence (x=x@entry=0x637050)
    at a-except.adb:934
#2  0x000000000040427b in ada.exceptions.complete_and_propagate_occurrence (
    x=x@entry=0x637050) at a-except.adb:943
#3  0x00000000004042d0 in <__gnat_raise_exception> (e=0x62ec40 <constraint_error>,
    message=...) at a-except.adb:982
#4  0x0000000000402860 in main.nested ()
#5  0x000000000040287c in main ()

And we now know exactly where the exception was raised.

But in fact, we could have this information directly when running the application.

For this, we need to bind the application with the switch -E, which tells the binder to store exception tracebacks in exception occurrences. Let's recompile and rerun the application.

  > rm *.o   # Cleanup previous compilation
  > gnatmake -g main.adb -bargs -E
  > ./main

Exception name: CONSTRAINT_ERROR
Message: some message
Call stack traceback locations:
0x10b7e24d1 0x10b7e24ee 0x10b7e2472

The traceback, as is, is not very useful. We now need to use another tool that is bundled with GNAT, called addr2line. Here is an example of its use:

  > addr2line -e main --functions --demangle 0x10b7e24d1 0x10b7e24ee 0x10b7e2472
/path/main.adb:8
_ada_main
/path/main.adb:12
main
/path/b~main.adb:240

This time we do have a symbolic backtrace, which shows information similar to what we got in the debugger.

For users on OSX machines, addr2line does not exist. On these machines, however, an equivalent solution exists. You need to link your application with an additional switch, and then use the tool atos, as in:

   > rm *.o
   > gnatmake -g main.adb -bargs -E -largs -Wl,-no_pie
   > ./main

Exception name: CONSTRAINT_ERROR
Message: some message
Call stack traceback locations:
0x1000014d1 0x1000014ee 0x100001472
   > atos -o main 0x1000014d1 0x1000014ee 0x100001472
main__nested.2550 (in main) (main.adb:8)
_ada_main (in main) (main.adb:12)
main (in main) + 90

We will now discuss a relatively new switch of the compiler, namely -gnateE. When used, this switch will generate extra information in exception messages.

Let's amend our test program to:

pragma Ada_05;
with Ada.Exceptions;
with Ada.Text_IO;      use Ada.Text_IO;

procedure Main is

    procedure Nested (Index : Integer) is
       type T_Array is array (1 .. 2) of Integer;
       T : constant T_Array := (10, 20);
    begin
       Put_Line (T (Index)'Img);
    end Nested;

begin
    Nested (3);

exception
    when E : others =>
        Put_Line (Ada.Exceptions.Exception_Information (E));
end Main;

We compile it with the additional switch and then run it:

  > gnatmake -g main.adb -gnateE -bargs -E -g -largs
  > ./main

Exception name: CONSTRAINT_ERROR
Message: main.adb:10:18 index check failed
index 3 not in 1..2
Call stack traceback locations:
0x100001429 0x1000014c7 0x1000013c2

The exception information (traceback) is the same as before, but this time the exception message is set automatically by the compiler. So we know we got a Constraint_Error because an incorrect index was used at the named source location (main.adb line 10). But the significant addition is the second line of the message, which indicates exactly the cause of the error. Here, we wanted to get the element at index 3, in an array whose range of valid indexes is from 1 to 2.

No need for a debugger in this case.

The column information on the first line of the exception message is also very useful when dealing with null pointers. For instance, a line such as:

   A := Rec1.Rec2.Rec3.Rec4.all;

where each of the Rec is itself a pointer, might raise Constraint_Error with a message "access check failed". This indicates for sure that one of the pointers is null, and by using the column information it is generally easy to find out which one it is.

Emmanuel Briot
AdaCore

Emmanuel Briot has been with AdaCore since 1998. He has been involved in a variety of projects, in particular oriented towards graphical user interfaces, including GtkAda, GPS, XML/Ada, GnatTracker and our internal CRM. He holds an engineering degree from the Ecole Nationale des Telecommunications (Brest, France).

Gem #141 : Con-figure it out

Let's get started...

In the Gem series on GNAT.Command_Line (Gems #138 and #139), we mentioned that there are several ways a user can control the behavior of an application. These are command-line options (as discussed in that Gem), graphical user applications (for instance using GtkAda), and configuration files. This Gem proposes various approaches for the latter.

Configuration files are user- or machine-editable files. This is a general term, though, which does not imply any specific format. Several formats are frequently used, and we discuss a few such formats in this Gem.

XML files

XML is very frequently used to store machine-editable data, but it is not exactly a human-friendly format and is extremely verbose. However, among its various pros are a rich ecosystem of tools to view and edit XML files, as well as standard APIs to parse and validate such files.

XML files can contain any type of data (including binary data, although they are not the best approach in that case), and they can be checked against grammars (also known as XML schemas).

Such files can be parsed and validated through the XML/Ada library, as discussed in Gem #21.

JSON files

JSON (JavaScript Object Notation) is a more recent format that has started to gain a strong foothold in modern applications, especially those that are web-oriented. This format is much lighter, and similar to JavaScript. Here is a quick example of such a file:

{"configuration": {
    "key1": "value1",
    "key2": 12,
    "key3": [1, 2, 3]
}

This format is easily readable and writable. One of its drawbacks, though, is the lack of syntax for comments, which are not supported. As such, the documentation needs to be put in a separate file. However, there are lots of libraries (in several programming languages) that can be used to parse these files. Most notably, all modern web browser are fully capable of receiving and then parsing these files, so they are a convenient way to exchange data (include configurations, for instance) among a web server and a web client.

In Ada, you could use the package GNATCOLL.JSON, part of the GNAT Component Collection. A future Gem will describe this package in more detail, although for now you can look at the specs.

INI files

These files were the standard for Windows applications and used the .ini extension, hence their name. This file format encompasses a vague syntax, the details of which vary from one application to another. Here is an example of such a file:

[General]
# Some documentation
key1=value1
key2=12
key3=1
key3=2

[Section2]
filename=$HOME/value2

This format contains key-value pairs, optionally organized into sections. Comments can be inserted as shown in the example above.

Such files can be parsed by the GNATCOLL.Config package, also part of the GNAT Components Collection.

The basic usage for this package is as follows.

   pragma Ada_05;
   with GNATCOLL.Config;   use GNATCOLL.Config;

   declare
      C : INI_Parser;
   begin
      Open (C, "settings.txt");
      while not C.At_End loop
         Put_Line (C.Key & " = " & C.Value & " in " & C.Section);
         C.Next;
      end loop;
   end;

The loop iterates over all key-value pairs in the file, and for each returns the value as a string, an integer, or a file. Files can be specified, in the config file, as absolute or relative to the location of the file, and GNATCOLL.Config will automatically normalize those. As a special case, the string "$HOME" is automatically substituted with the location of the user's home directory.

Since there is no grammar associated with the file, it's the application's responsibility to query the value in a valid format. In fact, in the example above, the value for "filename" can be queried as a string or a file, depending on the context, although of course it cannot be queried as an integer.

The example above uses the "object-dotted" notation (prefix form of subprogram calls) introduced by Ada 2005, but the library itself can be used with applications not programmed in Ada 2005.

While this API works as expected, the user has to store the values somewhere if he intends to use them later on. GNATCOLL.Config also provides a pool to store these for the duration of the application. This is especially convenient for preferences. In fact, this pool can be used for other configuration file formats as described above.

It is possible to parse multiple configuration files. Typically, an application would parse a system-wide settings file that contains defaults. These defaults can then be overridden by a user-specific settings file.

The pool will automatically remember where a value was read from, so that it can normalize file names relative to the given settings file.

The code would then look like:

   declare
      Pool   : Config_Pool;
      Parser : INI_Parser;
   begin
      Open (Parser, "/system/settings.txt");
      Fill (Pool, Parser);
      Open (Parser, "/user/settings.txt");
      Fill (Pool, Parser);  --  override with user-specific settings

      Put_Line ("key1 is " & Pool.Get ("key1", Section => "General"));
      Put_Line ("key3 is "
                & Pool.Get ("key3", Section => "General", Index => 1)
                & ", "

                & Pool.Get ("key3", Section => "General", Index => 2));
      Put_Line ("filename is "
                & Pool.Get_File ("filename", "General").Display_Full_Name);
   end;

However, there still remains one issue with this code. If the organization of the settings file changes (for instance, because you rename a key, or move it to a different section), you will need to find all locations in your code that referenced it, and change the name and/or section.

So GNATCOLL.Config provides one more approach which is to define the keys as global variables.

   declare
      Filename : constant Config_Key := Create ("filename", "General");
      Key1     : constant Config_Key := Create ("key1", "General");
      Key3     : constant Config_Key := Create ("key3", "General");
   begin
      Put_Line ("key1 is " & Key1.Get (Pool));
      Put_Line ("key3 is "
                & Key3.Get (Pool, 1) & ',' & Key3.Get (Pool, 2));
      Put_Line ("filename is " & Filename.Get_File.Display_Full_Name);
   end;

Now, only a single location in the code needs to be changed when the format of the configuration file changes.

By design, GNATCOLL.Config can be extended by declaring new types of parser derived from Config_Parser. These parsers could handle XML or JSON formats, and they would still be compatible with the Config_Pool described above, thus making it easy to access the configuration files throughout your application.

Emmanuel Briot
AdaCore

Emmanuel Briot has been with AdaCore since 1998. He has been involved in a variety of projects, in particular oriented towards graphical user interfaces, including GtkAda, GPS, XML/Ada, GnatTracker and our internal CRM. He holds an engineering degree from the Ecole Nationale des Telecommunications (Brest, France).

Gem #140: Bridging the Endianness Gap

Let's get started!

This Gem presents a new implementation-defined attribute introduced in the GNAT Ada tool suite that aims to facilitate the peaceful coexistence of big-endian and little-endian data applications. This is particularly important, for example, when retargeting an application from legacy hardware using a given endianness to another platform of different endianness. If any data was stored in memory or persistent storage by the legacy system, or if interoperability with other subsystems is to be preserved, it is necessary for all data structures to have precisely the same representation on the two platforms.

Consider the following two-byte data structure and record representation clause:

   subtype Yr_Type is Natural range 0 .. 127;
   subtype Mo_Type is Natural range 1 .. 12;
   subtype Da_Type is Natural range 1 .. 31;

   type Date is record
      Years_Since_1980 : Yr_Type;
      Month            : Mo_Type;
      Day_Of_Month     : Da_Type;
   end record;

   for Date use record
      Years_Since_1980 at 0 range 0  ..  6;
      Month            at 0 range 7  .. 10;
      Day_Of_Month     at 0 range 11 .. 15;
   end record;

The representation of "December 12th, 2012" on a big-endian system is as follows:

 

0100000 1 100 01100
yyyyyyy m mmm ddddd
65 140

 

On a little-endian system the same date is represented as:

 

0 0100000 01100 110
m yyyyyyy ddddd mmm
32 102

 

One may believe that standard attribute Bit_Order will resolve this situation. Let's give it a try, and amend the declaration above with:

for Date'Bit_Order use System.High_Order_First;

On a big-endian system this is the default, and we can verify that the representation is unchanged: the clause just restates the default explicitly (it is said to be confirming), and has no further effect.

On a little-endian system, however, the result isn't the expected one. To understand why, one needs to look at how the Ada standard defines the interpretation of record representation clauses.

When the bit order is the default bit order, growing bit offsets simply correspond to going into successive (growing) addresses in memory. But when the bit order is the opposite value, bits are numbered "backwards" with respect to the machine's way of storing data as successive bytes, so additional rules are required to know what bits we need to look at.

It is important to keep in mind that bit offsets for a component in a record representation clause are always relative to some integer value (called "machine scalars" in the Ada RM), from which the component value is extracted using a shift and a mask operation.

To find out which machine scalar a given component belongs to, you must first identify the set of components that share the same byte offset. In our example, this would be all three components, since all are specified with a byte offset of 0. Next, consider the highest bit offset for all of these components. Here, it's 15. The size of the machine scalar is then the next larger power of two: 16 in this example. So, this means that all three components in our record are part of a single machine scalar which is a two-byte integer of size 16 bits. If you are on a little-endian machine, the low-order byte of this machine scalar is always the one stored at the lower address, and the high-order byte is the one stored at the higher address. And it is essential to note that this is independent of the bit order specified for the data structure.

So, if you are on a little-endian system, and you specify a High_Order_First bit order, the two-byte machine scalar value will be:

 

0100000110001100
yyyyyyymmmmddddd

 

which when stored as two successive bytes in memory will correspond to:

 

10001100 01000001
140 65

 

It is interesting to note that this differs from both the native little-endian representation and the native big-endian representation. That's because the order in which the bytes that constitute machine scalars are written to memory is not changed by the Bit_Order attribute -- only the indices of bits within machine scalars are changed.

Now enter Scalar_Storage_Order

It is precisely in order to overcome this limitation of the language that a new attribute Scalar_Storage_Order was introduced in GNAT. The effect of this attribute is precisely to override the order of bytes in machine scalars for a given record type. So let's add another attribute definition:

for Date'Scalar_Storage_Order use System.High_Order_First;

This means that the bytes constituting the machine scalars must be swapped when stored in memory. We see that the memory representation then becomes (65, 140): it is now consistent with the native representation on a big-endian platform.

Existing code for a big-endian system can thus be ported to a little-endian system without any fuss, and without any change of data representation, just by adding appropriate attribute definitions on relevant record type declarations.

Compatibility with legacy toolchains

When retargeting an application with a change of endianness, it is convenient to use attribute Scalar_Storage_Order so that the new platform uses the same data representation as the old one. However, you might still want to be able to compile your application for your old target, with a legacy toolchain that might not support the newer attribute. In this case you can specify it using the alternate syntax:

pragma Attribute_Definition
   (Scalar_Storage_Order, Date, System.High_Order_First);

Older toolchains, which know nothing about the new attribute (or about the new implementation defined pragma Attribute_Definition), will simply ignore it, whereas newer compilers will treat this pragma as exactly equivalent to the corresponding attribute definition clause. The same application code can thus be compiled on both the legacy target with a legacy tool chain, and on the newer target (with different endianness) with a recent compiler, always using the same consistent data representation.

Demo program

Note: two versions of the demo program are provided below. Both produce the same result with current GNAT Pro releases. However, GNAT GPL 2012 was branched early during development of this feature and has an issue with the syntax used in the first vresion of the demo program. For that compiler release, you therefore need to use the endianness_demo_gpl2012.adb version, which uses an alternative syntax and produces the expected result.

When run on a little-endian machine, the attached demo program outputs:

Default bit order: LOW_ORDER_FIRST
N      : 32 102
LE_Bits: 32 102
BE_Bits: 140 65
LE:      32 102
BE:      65 140

 

On a big-endian machine, it shows:

Default bit order: HIGH_ORDER_FIRST
N      : 65 140
LE_Bits: 102 32
BE_Bits: 65 140
LE:      32 102
BE:      65 140

 

Thomas Quinot
AdaCore

Thomas Quinot holds an engineering degree from Télécom Paris and a PhD from Université Paris VI. The main contribution of his research work is the definition of a flexible middleware architecture aiming at interoperability across distribution models. He joined AdaCore as a Senior Software Engineer in 2003, and is responsible for the distribution technologies. He also participates in the development, maintainance and support of the GNAT compiler.

Leave a Comment

Have an idea for a Gem?

If you have an idea for a Gem you would like to contribute please feel free to contact us at: gems@adacore.com