Compiler

Table of Contents

1. What is a compiler?

In simplest terms, a compiler is a translator of programming languages. It generally translates some user friendly language into something that can be run and executed by the machine.

2. How does a compiler work?

  • A program is some language (Ex.: C) is given to the compiler
  • Compiler performs fornt-end analysis
  • This is turned into intermediate coded
  • Which is then converted to back end synthesis
  • This then become executable code

2.1. Front end analysis

Check if program is legal. If not, throws an error.

Generates intermediate representation (intermediate code?).

2.2. Intermediate code (IR/IC)

Multiple front ends can use the same back end? what does this mean?

Using IR makes it easier to port the compiler to different CPUs/architectures.

2.3. Back end synthesis

Translates intermediate representation into target code (Ex.: C -> Assembly).

3. The parts of a compiler

The compiler works top to bottom:

3.1. Lexical analyzer (Scanner)

Reads in strings and converts groups them into “lexemes.” A lexeme is a sequence of tokens that matches the pattern for a token. For example, in C, ’<=’ would be a lexeme corresponding to ’LESSTHANOREQUAL’. A lexer will translate lexemes into tokens.

A token here would be some piece of meaningful data which can be used in operations?

Skips whitespace & comments

Keeps track of lines/columns to properly report errors.

Scanner gives parser a token

A lexer does not detect errors, except illegal chars.

3.2. Syntax analyzer (Parser)

Creates a syntax tree

Parser gets token from scanner via getNextToken()

3.3. Semantic analyzer

Check operations, catches anything illegal

3.4. IR generator

Generates IR

3.5. Code optimizer

Optimizes code

3.6. Target code generator

Generates the code which can be executed by the specific machine

3.7. Symbol table

Compilers often use symbol tables. When a compiler encounters and identifier it doesn’t recognize, it inserts it into the symbol table. Next time it finds this identifier, it can simply just reference the symbol table.

4. (F)lex

We can create a lexer with (F)lex. Flex takes in a set of regexps + actions and outputs C code implementing a scanner.

A flex input file has the structure

definitions
%%
rules
%%
user code

But what does all this mean?

4.1. Definitions

This is where we define patterns, aka regexps, and #includes. For example:

%{
#include <stdio.h>
#inlcude <stdlib.h>
}%
DIGIT        [0-9]
CommentStart "//"
ID           [a-zA-Z][a-zA-Z0-9]

4.2. Rules

Rules take the form pattern action.

4.2.1. Pattern

a regexp to be matched

If more than one pattern is matched, longest match has priority. If duplicates, the first rule in the input file is chosen.

4.2.2. Action

an action to be performed when regexp is matched

If there is no match, default is to copy next character to stout

4.3. User code

The actual code of the program.

4.4. yylex()

Flex has a scanner implemented as a fucntion:

int yylex();
  • This returns which token is matched (int).
  • The lexeme is stored int yytext(a char *) (what is this?)
  • The lexeme length is stored in yyleng

5. Automata and language theory

An ε transition counts as a transition consuming no input?

5.1. Deterministic automation (DFA)

A model that consists of:

  • S, a set of states
  • /Sigma , the input alphatbet
  • A state s elem S, the start state
  • F subset S, a set of final states
  • Move, transitions between states

5.2. Non deterministic finite automation (NFA)

A mathematical model that consists of the same things as a DFA, except for the move function.

Basically, a DFA which may have the same option (a self loop, a to next state) leading to multiple states.h

5.3. Non deterministic finite automation (NFA)

A mathematical model that consists of the same things as a DFA, except for the move function.

Basically, a DFA which may have the same option (a self loop, a to next state) leading to multiple states.h

Simulating one can be inefficient. Can produce exponential running time. To fix this, we can build an equivalent DFA.

6. Syntax analysis

6.1. Parser

Parsing i th process of determining whether a string of tokens can be produced by a grammar

Most parsing methods take two forms:

6.1.1. Top-down parser

Easy to build by hand, but some grammars are impossible to parse.

6.1.2. Bottom-up parser

Harder to build by can, but can handle more grammars.

7. Parse tree LL?

Parse trees are created from the root towards the leaves scanning input from left to right.

8. Bottom up parsing

The process of reducing the input string to the start symbol of the grammar.

8.1. LR parsers

LR parsers care about what has been in the previous input. LL parsers do not, they only care about the current token. We replace input tokens with language tokens (?) if applicatble.

These are more powerful that LL parsers, they, for example, don’t have the left-recursive issue.

8.1.1. Key operations

  • Shift

    Move to stack

  • Reduce

    Identify and replace

  • Accept

    Stack is reduced to stack symbols

  • Error

    Something fucked up

8.1.2. Key issues

Should you:

  • Shift reduce
  • Reduce reduce

?

8.1.3. States of a parser

  • Closure

    We use a dot as a separator. Behind the dot is what we have parsed. After the dot is what we have yet to parse. We generally disregard the extra states after a state after the dot is accepted.

  • Goto

    Goto goes to some other state. It basically moved the dot past the last accepted state.

8.2. Shift reduce parsing (stack)

  • Move the first token from input stream into the stack
  • Replace token in stack with some rule “T” if applicable
  • Reduce stack as much as possible
  • If reduced as much as possible, move next input token into stack
  • Repeat step 2 until EOF is reached
  • If possible, reduce
  • You’re done

8.3. Parse tree

The parse tree is constructed beginning at the leaves and going towards the root.

9. Compiler data structures

On the semantic-analysis side of things, a compiler generally includes several data structures to simplify and abstract the compilation process.

9.1. Symbol table

A symbol table stores all the symbols encountered within a program. These symbols are generally identifiers (variables, constants).

When a complier encounters a new identifier, it adds an entry for that identifier into the symbol table. If the compiler encounters existing identifiers, it check the symbol table for scope, value, etc.

Each entry in the symbol table also generally stores the:

  • Value
  • Data type
  • Scope
  • Number of parameters
  • And other things as needed

A symbol table is often implemented with the use of some data structure, more often than not, a hash map.

Add the symbol table file

9.2. String table

A string table is used to store strings that are found in the code. If multiple strings are found in the same place within the code, instead of storing multiple distinct references to the string, the string can be store as an indexed entry within a string array.

String table can be linear, or can be implemented via a hashmap.

9.3. Symbol table vs String table

The symbol table and string table are fairly similar concept. The difference between them can be a bit confusing.

The primary function of a string table is to optimize the storage and loading of data. This leads to the system running faster and using less memory. When an identifier is encountered in the source code, its text is stored in the string table (if not already present). The compiler uses the string table to handle the textual aspect of the identifier efficiently.

A symbol table contains much more information than this. It contains information about each identifier in the table as described above. The same identifier also has an entry in the symbol table, but with additional contextual information. The symbol table entry links to the string table for the identifier’s name but also includes information like its type, scope, and usage within the program.

9.4. Abstract syntax tree

An abstract syntax tree is arguably the most complicated structure out of these three. The semantics of a program can be represented by an abstract syntax tree. The tree is abstract in the way that it omits certain elemnts like parantheses or semicolons which are not necessary for understanding the syntax.

An abstract syntax tree is made out of two parts:

  • Nodes

    Nodes can represent “points” in source code. A node can store literals (constants), identifiers (variables), or more complex things like expressions, statements, and control structures (loops, conditionals).

  • Edges

    The edges simply link the nodes together, creating an otherwise sporadic pool of nodes into a readable data structure.

Author: Jay

Created: 2025-04-15 Tue 16:07