Hackfest, 29 September 2024, 12:00
Frans Faase
https://www.iwriteiam.nl/Hackfest_2024.html
Linux is an open-source operating system build from available sources. For closed source operating systems, like Windows and MacOS, there is no way to verify that it is reliable.
In February 2024, a malicious backdoor was introduced to the Linux utility xz within the liblzma library in versions 5.6.0 and 5.6.1 by an account using the name "Jia Tan". The backdoor gives an attacker who possesses a specific Ed448 private key remote code execution capabilities on the affected Linux system.
The backdoor was not introduced in the code itself, but added to the code during the execution of a modified version of the build script that was only distributed with the tar files.
The injection of these kind of backdoors can be avoided by proper reviews and/or automatic checks.
The backdoor did get into some running release distributions. Some people prefer running releases because they are often the first to have security updates. But now it appears that they are also the first to have malicious backdoors.
Although the sources of Linux are all known, building the Linux kernel a compiler is needed to compile the C-source files into executable of the target platform.
A compiler is a program which translates a program from an input language to an output language. The input language is often a high-level, human readable language, while the output language is a low-level language, including machine language, which can be executed by the CPU. A compiler, like any other program, is programmed in a language, usually in machine language, or otherwise, depends on some other program based on machine language.
Machine language is usually in a binary form that is not easily readable by humans. This poses a problem for a non-trivial program, such as an optimizing compiler. Although reverse engineering executables is possible, it is a lot of work.
A malicious compiler could insert a back-door in some executables or libraries of the Linux kernel. Such a compiler could recognize when it is compiled by itself and duplicate the malicious code in the resulting compiler, making it malicious as well.
Ken Thompson presented the idea of a malicious compiler during the Turing Award Lecture Reflections on Trusting Trust in 1984.
The live-bootstrap approach is to build the necessary tools for compiling the Tiny C Compiler in a number of steps from a small binary file. Next the Tiny C Compiler, again with many steps, is used to compile GNU C Compiler 4.0.4, 4.7.4 (the last one written in C), 10,4.0, and finally 13.1.0. This last compiler can be used to compiler the Linux kerner sources.
For executables, Linux makes use of Executable and Linkable Format (ELF) files. These files start with a header, followed by the program header table, an number of code and/or data segments, and (optionally) a number of section header tables, which include symbol tables with information that can be used by a debugger.
The small binary is a program called hex0, which converts a file with hexadecimal numbers into a binary file. This program is specific for the target CPU for which Linux needs to be build. These programs can be found in the bootstrap-seeds repository. In the README.md file it states: 'NEVER TRUST ANYTHING IN HERE'. For x86 (32 bits) the hexadecimal representation of this program is given in the file hex0_x86.hex0, meaning that if this file is compiled with hex0 it will return hex0, assuming that it is not a malicious variant of hex0. The hex0 program requires two arguments, the names of the input and output files. The function of the program can also be achieved with following command line:
sed 's/[;#].*$//g' $input_file | xxd -r -p > $output_fileThis makes use of the stream editing program sed and the xxd (hex dump) program that can be used to convert, in both directions, between binary files and hexadecimal files.
hex0_x86.hex0
# SPDX-FileCopyrightText: 2019 Jeremiah Orians # SPDX-FileCopyrightText: 2022 Andrius Štikonas # # SPDX-License-Identifier: GPL-3.0-or-later ## ELF Header #:ELF_base 7F 45 4C 46 # e_ident[EI_MAG0-3] ELF's magic number 01 # e_ident[EI_CLASS] Indicating 32 bit 01 # e_ident[EI_DATA] Indicating little endianness 01 # e_ident[EI_VERSION] Indicating original elf 03 # e_ident[EI_OSABI] Set at 3 because FreeBSD is strict 00 # e_ident[EI_ABIVERSION] Set at 0 because no one cares 00 00 00 00 00 00 00 # e_ident[EI_PAD] 02 00 # e_type Indicating Executable 03 00 # e_machine Indicating x86 01 00 00 00 # e_version Indicating original elf 54 80 04 08 # e_entry Address of the entry point 34 00 00 00 # e_phoff Address of program header table 00 00 00 00 # e_shoff Address of section header table 00 00 00 00 # e_flags 34 00 # e_ehsize Indicating our 52 Byte header 20 00 # e_phentsize size of a program header table 01 00 # e_phnum number of entries in program table 00 00 # e_shentsize size of a section header table 00 00 # e_shnum number of entries in section table 00 00 # e_shstrndx index of the section names ## Program Header #:ELF_program_headers #:ELF_program_header__text 01 00 00 00 # ph_type: PT-LOAD = 1 00 00 00 00 # ph_offset 00 80 04 08 # ph_vaddr 00 80 04 08 # ph_physaddr 00 01 00 00 # ph_filesz 00 01 00 00 # ph_memsz 07 00 00 00 # ph_flags: PF-X|PF-W|PF-R = 7 01 00 00 00 # ph_align #:ELF_text # Where the ELF Header is going to hit # Simply jump to _start # Our main function # :_start ; (0x8048054) 58 ; pop_eax # Get the number of arguments 5B ; pop_ebx # Get the program name 5B ; pop_ebx # Get the actual input name 31C9 ; xor_ecx,ecx # prepare read_only, ecx = 0 31D2 ; xor_edx,edx # Extra sure, edx = 0 6A 05 ; push !5 # prepare to set eax to 5 58 ; pop_eax # the syscall number for open() CD 80 ; int !0x80 # Now open that damn file 89C6 ; mov_esi,eax # Preserve the file pointer we were given 5B ; pop_ebx # Get the actual output name 66B9 4102 ; mov_cx, @577 # Prepare file as O_WRONLY|O_CREAT|O_TRUNC 66BA C001 ; mov_dx, @448 # Prepare file as RWX for owner only (700 in octal) 6A 05 ; push !5 # prepare to set eax to 5 58 ; pop_eax # the syscall number for open() CD 80 ; int !0x80 # Now open that damn file 89C2 ; mov_edx,eax # Preserve the file pointer we were given # Our flag for byte processing 6A FF ; push !-1 5D ; pop_ebp # ebp = -1 # temp storage for the sum 31FF ; xor_edi,edi # edi = 0 #:loop ; (0x8048077) # Read a byte E8 68000000 ; call %Read_byte # process byte E8 1B000000 ; call %hex # Deal with -1 values 85C0 ; test_eax,eax 7C F2 ; jl !loop # deal with toggle 85ED ; test_ebp,ebp # jump if ebp >= 0 7D 06 ; jge !print # process first byte of pair 89C7 ; mov_edi,eax 31ED ; xor_ebp,ebp # ebp = 0 EB E8 ; jmp !loop # process second byte of pair #:print ; (0x804808F) # update the sum and store in output C1E7 04 ; shl_edi, !4 01F8 ; add_eax,edi # flip the toggle 4D ; dec_ebp # ebp = -1 E8 39000000 ; call %write_byte EB DB ; jmp !loop #:hex ; (0x804809C) # Purge Comment Lines (#) 3C 23 ; cmp_al, !35 74 1E ; je !purge_comment # Purge Comment Lines (;) 3C 3B ; cmp_al, !59 74 1A ; je !purge_comment # deal all ascii less than 0 3C 30 ; cmp_al, !48 7C 1F ; jl !ascii_other # deal with 0-9 3C 3A ; cmp_al, !58 7C 1F ; jl !ascii_num # deal with all ascii less than A 3C 41 ; cmp_al, !65 7C 17 ; jl !ascii_other # deal with A-F 3C 47 ; cmp_al, !71 7C 1C ; jl !ascii_high # deal with all ascii less than a 3C 61 ; cmp_al, !97 7C 0F ; jl !ascii_other # deal with a-f 3C 67 ; cmp_al, !103 7C 12 ; jl !ascii_low # The rest that remains needs to be ignored EB 09 ; jmp !ascii_other #:purge_comment ; (0x80480BE) # Read a byte E8 21000000 ; call %Read_byte # Loop if not LF 3C 0A ; cmp_al, !10 75 F7 ; jne !purge_comment # Otherwise return -1 #:ascii_other ; (0x80480C7) 6A FF ; push !-1 58 ; pop_eax # return -1 C3 ; ret #:ascii_num ; (0x80480CB) 2C 30 ; sub_al, !48 C3 ; ret #:ascii_low ; (0x80480CE) 2C 20 ; sub_al, !32 # convert to uppercase #:ascii_high ; (0x80480D0) 2C 37 ; sub_al, !55 C3 ; ret # Writes byte stored in al #:write_byte ; (0x80480D3) # Print our Hex 89D3 ; mov_ebx,edx # Where are we writing to 52 ; push_edx # protect fout 6A 01 ; push !1 # prepare to set edx to 1 5A ; pop_edx # set the size of chars we want 50 ; push_eax # Move output to stack 89E1 ; mov_ecx,esp # What we are writing 6A 04 ; push !4 # prepare to set eax to 4 58 ; pop_eax # the syscall number for write CD 80 ; int !0x80 # call the Kernel 5B ; pop_ebx # deallocate stack 5A ; pop_edx # restore fout C3 ; ret #:Read_byte ; (0x80480E4) # Attempt to read 1 byte from Input file 52 ; push_edx # protect fout 6A 01 ; push !1 # prepare to set edx to 1 5A ; pop_edx # set the size of chars we want 57 ; push_edi # allocate stack 89E1 ; mov_ecx,esp # Where to put it 89F3 ; mov_ebx,esi # Where are we reading from 6A 03 ; push !3 # prepare to set eax to 3 58 ; pop_eax # the syscall number for read CD 80 ; int !0x80 # call the Kernel 85C0 ; test_eax,eax # check what we got 74 03 ; je !Done # Got EOF call it done # load byte 58 ; pop_eax # load char 5A ; pop_edx # restore fout C3 ; ret #:Done ; (0x80480F9) # program completed Successfully 31DB ; xor_ebx,ebx # All is well, ebx = 0 6A 01 ; push !1 58 ; pop_eax # put the exit syscall number in eax CD 80 ; int !0x80 # Call it a good day #:ELF_end
Brainfuck is an esoteric program language with only eight commands, represented by the letters '+-<>[].,'. I wanted to see if it was possible to write a Brainfuck program to replace hex0. I wrote some JavaScript to generate a program from a simple programming language: BF generator. I verified that with some Brainfuck interpreter running it on hex0_x86.hex0 did produce a file equal to hex0.
Blog posts:
I decided to look some deeper into live-bootstrap to figure out which programs are executed and which files are being read. There is a global description of all the steps that is found in parts.rst. live-bootstrap comes with a script to execute all steps and a minimal shell, called the kaem shell, to execute the script. I wrote a program, kaem_parser.cpp, to parse the kaem files, which generates the HTML page live-bootstrap with all the information.
Blog posts:
To even dive deeper, I decided to implement an x86 emulator that implements only the x86 instructions that are actually used. The emulator also implments systems calls and multiple processes. I decided to simply map file operations on file operations on the file system, instead of simulating those in memory.
This allowed me also to verify the code of hex0 and to see if no other binary files were used.
The development of the emulator required some very complicated debuging, but it was also insightfull.
The emulator turned out too be slow to compile GNU Mes.
Blog posts:
With strace command it is possible to trace system calls that are performed during the execution of a command. I wrote a shell script to execute all the steps with an alternative root (using the chroot command) and trace some relevant system calls. I also wrote a program, called scan_trace.cpp to process the produced log file and generate an HTML page listing the processes and source files. It also generates a JSON file with similar information, which is used to generate a T-diagram
GNU Mes consist of Scheme (a dialect of the LISP family) interpreter programmed in a subset of C and a C compiler written in Scheme. It is a rather complete compiler including a separate linker. I get the idea that it is on par with the Tiny C compiler, which does not need a separate linker, because is not needed for compiling the Tiny C compiler. The MES compiler is rather slow, which is understandable because it is an interpreting runing an interpreting compiler. It is also slow because it compiles many individual C files into object files, which are linked at the end. Everytime all the Scheme files are loaded before the compiler can start its actual work.
The GNU Mes compiler requires a compiler for a minimal subset of C. Because of this the bootstrap does contain include a subset of C compiler and a a partial implementation of the C preprocessor. These are written in another subset of C for which there is compiler written in machine language, one for each supported CPU. I have the impression that the GNU Mes compiler is a bit of overkill for the gap it needs to fill.
On several forums, I have read people raising the question why Forth was not used in the implementation instead of LISP.
I made some investigation and attempts to bridge the gap. To see if it would be possible to implement a compiler with M2-Mesoplanet. I encountered some bugs in M2-Mesoplanet.
Blog posts:
Due to the bugs found in M2-Mesoplanet, I looked into implementing a stack based, Forth like, language that would be close to C and simple to compile. A language that could also serve as an intermediate language. For an example program see test.sl, which is an attempt to implement tcc_cc.c. The file stack_c.cpp contains the compiler for the stack based language.
Blog posts: