programming_under_the_hood/FilesCh.xml at master · johnnyb/programming_under_the_hood · GitHub

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
<chapter id="filesch">
<title>Dealing with Files</title>
<!--

Copyright 2002 Jonathan Bartlett

Permission is granted to copy, distribute and/or modify this
document under the terms of the GNU Free Documentation License,
Version 1.1 or any later version published by the Free Software
Foundation; with no Invariant Sections, with no Front-Cover Texts,
and with no Back-Cover Texts.  A copy of the license is included in fdl.xml

-->

<para>
A lot of computer programming deals with files<indexterm><primary>files</primary></indexterm>.  After all, when we reboot our
computers, the only thing that remains from previous sessions are
the things that have been put on disk.  Data which is stored in
files is called <emphasis>persistent<indexterm><primary>persistance</primary></indexterm></emphasis> data,
because it persists in files that remain on the disk even when the program isn't running..
</para>

<sect1>
<title>The UNIX File Concept</title>

<para>
Each operating system has its own way of dealing with files.  However, the
UNIX method, which is used on Linux, is the simplest and most universal.
UNIX files, no matter what program created them, can all be accessed as a
sequential stream of bytes.  When you access a file, you start by opening it by name.
The operating system then gives you a number, called a
<emphasis>file descriptor<indexterm><primary>file descriptors</primary></indexterm></emphasis>,
which you use to refer to the file until you are
through with it.  You can then read and write to the file using its
file descriptor.  When you are done reading and writing, you then close
the file, which then makes the file descriptor useless.
</para>

<para>
In our programs we will deal with files in the following ways:
</para>

<orderedlist>

<listitem><para>
Tell Linux the name of the file to open, and in what mode you want it
opened (read, write, both read and write, create it if it doesn't exist,
etc.).  This is handled with the
<literal>open<indexterm><primary>open</primary></indexterm></literal>
system call, which takes a filename, a number representing the mode,
and a permission<indexterm><primary>permissions</primary></indexterm> set as its parameters.  &eax-indexed; will hold the system call number, which is 5.
The address of the first character of the filename should be stored
in &ebx-indexed;.  The read/write intentions, represented as
a number, should be stored in &ecx-indexed;.  For now, use 0 for files you want to read
from, and 03101 for files you want to write to
(you must include the leading zero).<footnote><para>This will be explained
in more detail in <xref linkend="truthbinarynumbers" />.</para></footnote>
Finally, the permission set should be stored as a number in &edx-indexed;.
If you are unfamiliar with UNIX permissions, just use 0666 for the
permissions (again, you must include the leading zero).
</para></listitem>

<listitem><para>
Linux will then return to you a file descriptor<indexterm><primary>file descriptors</primary></indexterm> in
&eax-indexed;.  Remember, this is a number that you use to refer to this file
throughout your program.
</para></listitem>

<listitem><para>
Next you will operate on the file doing reads and/or writes, each time
giving Linux the file descriptor you want to use.  <literal>read<indexterm><primary>read</primary></indexterm></literal>
is system call 3, and to call it you need to have the file descriptor
in &ebx;, the address of a buffer for storing the
data that is read in &ecx;, and the size of the buffer in &edx;.
Buffers will be explained in <xref linkend="buffersbss" />.
<literal>read</literal> will return with either the number of
characters read from the file, or an error code.  Error codes can be
distinguished because they are always negative numbers (more information
on negative numbers can be found in <xref linkend="countingchapter" />).
<literal>write<indexterm><primary>write</primary></indexterm></literal>
is system call 4, and it requires the same
parameters as the <literal>read</literal> system call, except that the
buffer should already be filled with the data to write out.  The
<literal>write</literal> system call will give back the number
of bytes written in &eax; or an error code.
</para></listitem>

<listitem><para>
When you are through with your files, you can then tell Linux to close them.
Afterwards, your file descriptor<indexterm><primary>file descriptors</primary></indexterm> is no longer valid.
This is done using <literal>close<indexterm><primary>close</primary></indexterm></literal>, system call 6.  The only
parameter to <literal>close</literal> is the file descriptor, which is
placed in &ebx;
</para></listitem>

</orderedlist>


</sect1>

<sect1 id="buffersbss">
<title>Buffers and <literal>.bss</literal></title>

<para>
In the previous section we mentioned buffers<indexterm><primary>buffers</primary></indexterm> without explaining what they
were.  A buffer is a continuous block of bytes used for bulk data transfer.
When you request to read a file, the operating system needs to have a place
to store the data it reads.  That place is called a buffer.   Usually buffers
are only used to store data temporarily, and it is then read from the buffers
and converted to a form that is easier for the programs to handle.  Our
programs won't be complicated enough to need that done.
For an example, let's say that you want to read in a single line of text from
a file but you do not know how long that line is.  You would then
simply read a large number of bytes/characters from the file into a buffer,
look for the end-of-line character, and copy all of the characters to that
end-of-line character to another location.  If you didn't find an end-of-line
character, you would allocate another buffer and continue reading.
You would probably wind up with some characters left over in your buffer
in this case, which you would use as the starting point when you next need
data from the file.<footnote><para>While this sounds complicated, most of
the time in programming you will not need to deal directly with buffers
and file descriptors.  In <xref linkend="linking" /> you will learn how
to use existing code present in Linux to handle most of the complications
of file input/output for you.</para></footnote>
</para>

<para>
Another thing to note is that buffers<indexterm><primary>buffers</primary></indexterm> are a fixed size, set by the programmer.
So, if you want to read in data 500 bytes at a time, you send the
<literal>read</literal> system call the address of a 500-byte unused location,
and send it the number 500 so it knows how big it is.  You can make it smaller
or bigger, depending on your application's needs.
</para>

<para>
To create a buffer, you need to either reserve static or dynamic storage.
Static storage is what we have talked about so far, storage locations
declared using <literal>.long</literal> or <literal>.byte</literal> directives.
Dynamic storage will be discussed in <xref linkend="dynamicmemory" />.
There are problems, though, with declaring buffers using <literal>.byte<indexterm><primary>.byte</primary></indexterm></literal>.
First, it is tedious to type.  You would have to type 500 numbers after
the <literal>.byte</literal> declaration, and they wouldn't be used
for anything but to take up space.
Second, it uses up space in the executable.  In the examples we've
used so far, it doesn't use up too much, but that can change in larger
programs.  If you want 500 bytes you have to type in 500 numbers and it
wastes 500
bytes in the executable.  There is a solution to both of these.  So far,
we have discussed two program sections, the <literal>.text<indexterm><primary>.text</primary></indexterm></literal> and
the <literal>.data<indexterm><primary>.data</primary></indexterm></literal> sections.  There is another section called
the <literal>.bss<indexterm><primary>.bss</primary></indexterm></literal>.  This section is like the data section, except
that it doesn't take up space in the executable.  This section can
reserve storage, but it can't initialize it.  In the <literal>.data</literal>
section, you could reserve storage and set it to an initial value.  In the
<literal>.bss</literal> section, you can't set an initial value.  This is
useful for buffers because we don't need to initialize them anyway, we
just need to reserve storage.  In order to do this, we do the following
commands:
</para>

<programlisting>
.section .bss
	.lcomm my_buffer, 500
</programlisting>

<para>
This directive, <literal>.lcomm<indexterm><primary>.lcomm</primary></indexterm></literal>, will create a symbol, <literal>my_buffer</literal>, that refers to
a 500-byte storage location that we can use as a buffer.  We can then do
the following, assuming we have opened a file for reading and have placed
the file descriptor in &ebx;:
</para>

<programlisting>
	movl $my_buffer, %ecx
	movl 500, %edx
	movl 3, %eax
	int  $0x80
</programlisting>

<para>
This will read up to 500 bytes into our buffer.  In this example, I placed
a dollar sign in front of <literal>my_buffer</literal>.  Remember that the
reason for this is that without the dollar sign,
<literal>my_buffer</literal> is treated as a memory location, and is
accessed in direct addressing mode<indexterm><primary>direct addressing mode</primary></indexterm>.  The dollar sign switches it to immediate mode addressing,
which actually loads the number represented by <literal>my_buffer</literal>
itself (i.e. - the address of the start of our buffer, which is the address
of <literal>my_buffer</literal>) into &ecx;.
</para>

</sect1>

<sect1>
<title>Standard and Special Files</title>

<para>
You might think that programs start without any files open by default.  This
is not true.  Linux programs usually have at least three open file descriptors<indexterm><primary>file descriptors</primary></indexterm>
when they begin.  They are:
</para>

<variablelist>

<varlistentry>
<term>STDIN<indexterm><primary>STDIN</primary></indexterm></term>
<listitem><para>
This is the <emphasis>standard input<indexterm><primary>standard input</primary></indexterm></emphasis>.  It is a read-only file, and
usually represents your keyboard.<footnote><para>As we mentioned earlier, in Linux, almost everything is a "file".  Your keyboard input is considered a file, and so is your screen display.
</para></footnote>
This is always file descriptor 0.
</para></listitem>
</varlistentry>

<varlistentry>
<term>STDOUT<indexterm><primary>STDOUT</primary></indexterm></term>
<listitem><para>
This is the <emphasis>standard output<indexterm><primary>standard output</primary></indexterm></emphasis>.  It is a write-only file,
and usually represents your screen display.  This is always file descriptor 1.
</para></listitem>
</varlistentry>

<varlistentry>
<term>STDERR<indexterm><primary>STDERR</primary></indexterm></term>
<listitem><para>
This is your <emphasis>standard error<indexterm><primary>standard error</primary></indexterm></emphasis>.  It is a write-only file,
and usually represents your screen display.  Most regular processing
output goes to <literal>STDOUT</literal>, but any error messages that come
up in the process go to <literal>STDERR</literal>.  This way, if you want to,
you can split them up into separate places.  This is always file
descriptor 2.
</para></listitem>
</varlistentry>
</variablelist>

<para>
Any of these "files<indexterm><primary>files</primary></indexterm>" can be redirected from or to a real file, rather
than a screen or a keyboard.  This is outside the scope of this book, but
any good book on the UNIX command-line<indexterm><primary>command-line</primary></indexterm> will describe it in detail.
The program itself does not even need to be aware of this indirection -
it can just use the standard file descriptors<indexterm><primary>file descriptors</primary></indexterm> as usual.
</para>

<para>
Notice that many of the files you write to aren't files at all.  UNIX-based
operating systems treat all input/output systems as files.  Network connections
are treated as files<indexterm><primary>files</primary></indexterm>, your serial port is treated like a file, even your audio
devices are treated as files.  Communication between processes is
usually done through special files<indexterm><primary>special files</primary></indexterm> called pipes<indexterm><primary>pipes</primary></indexterm>.   Some of these files have different methods of opening
and creating them than regular files<indexterm><primary>regular files</primary></indexterm> (i.e. - they don't use the <literal>open</literal> system call), but they can all be read from and
written to using the standard <literal>read</literal> and
<literal>write</literal> system calls.
</para>

</sect1>

<sect1>
<title>Using Files in a Program</title>

<para>
We are going to write a simple program to illustrate these concepts.  The
program will take two files, and read from one, convert all of its
lower-case letters to upper-case, and write to the other file.  Before we
do so, let's think about what we need to do to get the job done:
</para>

<itemizedlist>
<listitem><para>
Have a function that takes a block of memory and converts it to upper-case.
This function would need an address of a block of memory and its size as
parameters.
</para></listitem>

<listitem><para>
Have a section of code that repeatedly reads in to a buffer, calls our
conversion function on the buffer, and then writes the buffer back out
to the other file.
</para></listitem>

<listitem><para>
Begin the program by opening the necessary files.
</para></listitem>
</itemizedlist>

<para>
Notice that I've specified things in reverse order that they will be done.
That's a useful trick in writing complex programs - first decide the meat
of what is being done.  In this case, it's converting blocks of characters
to upper-case.  Then, you think about what all needs to be setup and processed to
get that to happen.  In this case, you have to open files,
and continually read and write
blocks to disk.  One of the keys of programming is continually breaking
down problems into smaller and smaller chunks until it's small enough that
you can easily solve the problem.  Then you can build these chunks back
up until you have a working program.<footnote><para>Maureen Sprankle's
<citetitle>Problem Solving and Programming Concepts</citetitle> is an
excellent book on the problem-solving process applied to computer programming.
</para></footnote>
<!-- FIXME - do I need to introduce flowcharting or reference it here? -->
</para>

<para>
You may have been thinking that you will never remember all of these numbers
being thrown at you - the system call numbers, the interrupt number, etc.
In this program we will also introduce a new directive, <literal>.equ</literal>
which should help out.  <literal>.equ<indexterm><primary>.equ</primary></indexterm></literal> allows you to assign names
to numbers.  For example, if you did
<literal>.equ LINUX_SYSCALL, 0x80<indexterm><primary>0x80</primary></indexterm></literal>, any time after that you wrote
<literal>LINUX_SYSCALL</literal>, the assembler would substitue
<literal>0x80</literal> for that.  So now, you can write
</para>

<programlisting>
int $LINUX_SYSCALL
</programlisting>

<para>
which is much easier to read, and
much easier to remember.  Coding is complex, but there are a lot of things
we can do like this to make it easier.
</para>

<para>
Here is the program.  Note that we have more labels<indexterm><primary>labels</primary></indexterm>
 than we actually use for jumps,
because some of them are just there for clarity.  Try to trace
through the program and see what happens in various cases.  An in-depth
explanation of the program will follow.
</para>

<programlisting>
&toupper-nomm-simplified-s;
</programlisting>

<para>
Type in this program as <filename>toupper.s</filename>, and then enter in
the following commands:
</para>

<programlisting>
as toupper.s -o toupper.o
ld toupper.o -o toupper
</programlisting>

<para>
This builds a program called <filename>toupper</filename>, which converts
all of the lowercase characters in a file to uppercase.
For example, to convert the file <filename>toupper.s</filename> to
uppercase, type in the following command:
</para>

<programlisting>
./toupper toupper.s toupper.uppercase
</programlisting>

<para>
You will now find in the file <filename>toupper.uppercase</filename> an
uppercase version of your original file.
</para>

<para>
Let's examine how the program works.
</para>

<para>
The first section of the program is marked <literal>CONSTANTS</literal>.  In
programming, a constant<indexterm><primary>constants</primary></indexterm> is a value that is assigned when a program assembles or
compiles, and is never changed.  I make a habit of placing all of my constants
together at the beginning of the program.  It's only necessary to declare them
before you use them, but putting them all at the beginning makes them easy to
find.  Making them all upper-case makes it obvious in your program which
values are constants and where to find them.<footnote><para>This is fairly
standard practice among programmers in all languages.</para></footnote>
In assembly language, we
declare constants with
the <literal>.equ<indexterm><primary>.equ</primary></indexterm></literal> directive as mentioned before.  Here, we simply
give names to all of the standard numbers we've used so far, like system call
numbers, the syscall interrupt number, and file open options.
</para>

<para>
The next section is marked <literal>BUFFERS</literal>.  We only use one buffer<indexterm><primary>buffer</primary></indexterm> in
this program, which we call <literal>BUFFER_DATA</literal>.  We also define a
constant, <literal>BUFFER_SIZE</literal>, which holds the size of the buffer.  If
we always refer to this constant rather than typing out the number 500 whenever
we need to use the size of the buffer, if it later changes, we only need to modify
this value, rather than having to go through the entire program and changing all
of the values individually.
</para>

<para>
Instead of going on to the <literal>_start</literal> section of the program, go
to the end where we define the <literal>convert_to_upper</literal> function.
This is the part that actually does the conversion.
</para>

<para>
This section begins with a list of constants that we will use
The reason these are put here rather than at the top is that they only
deal with this one function.  We have these definitions:
</para>

<programlisting>
	.equ  LOWERCASE_A, 'a'
	.equ  LOWERCASE_Z, 'z'
	.equ  UPPER_CONVERSION, 'A' - 'a'
</programlisting>

<para>
The first two simply define the letters that are the boundaries of what we are
searching for.  Remember that in the computer, letters are represented as numbers.
Therefore, we can use <literal>LOWERCASE_A</literal> in comparisons, additions,
subtractions, or anything else we can use numbers in.  Also, notice we define
the constant <literal>UPPER_CONVERSION</literal>.  Since letters are represented
as numbers, we can subtract them.  Subtracting an upper-case letter from the
same lower-case letter gives us how much we need to add to a lower-case letter
to make it upper case.  If that doesn't make sense, look at the
ASCII<indexterm><primary>ASCII</primary></indexterm> code tables
themselves (see <xref linkend="asciilisting" />).  You'll
notice that the number for the character <literal>A</literal> is 65 and the
character <literal>a</literal> is 97.  The conversion factor is then -32.
For any lowercase letter if you add -32, you will get its
capital equivalent.
</para>

<para>
After this, we have some constants labelled <literal>STACK POSITIONS</literal>.
Remember that function parameters<indexterm><primary>function parameters</primary></indexterm>
are pushed onto the stack before function calls.  These constants (prefixed with
<literal>ST</literal> for clarity) define where in the stack we should expect
to find each piece of data.  The return address<indexterm><primary>return address</primary></indexterm> is at position 4 + &esp;, the length of
the buffer is at position 8 + &esp;, and the address of the buffer is at position
12 + &esp;.   Using symbols for these numbers instead of the numbers themselves
makes it easier to see what data is being used and moved.
</para>

<para>
Next comes the label
<literal>convert_to_upper</literal>.  This is the entry point of
the function.  The first two lines are our standard function lines
to save the stack pointer.  The next two lines

<programlisting>
	movl  ST_BUFFER(%ebp), %eax
	movl  ST_BUFFER_LEN(%ebp), %ebx
</programlisting>

move the function parameters into the appropriate registers for use.
Then, we load zero into &edi;.  What we are going to do is iterate
through each byte of the buffer by loading from the location
&eax; + &edi;, incrementing &edi;, and repeating until &edi;
is equal to the buffer length stored in &ebx;.  The lines

<programlisting>
	cmpl  $0, %ebx
	je    end_convert_loop
</programlisting>

are just a sanity check to make sure that noone gave us a buffer of
zero size.  If they did, we just clean up and leave.  Guarding
against potential user and programming errors is an important
task of a programmer.  You can always specify that your function should
not take a buffer of zero size, but it's even better to have the
function check and have a reliable exit plan if it happens.
</para>

<para>
Now we start our loop.  First,
it moves a byte into &cl;.  The code for this is

<programlisting>
	movb  (%eax,%edi,1), %cl
</programlisting>

It is using an indexed indirect addressing mode<indexterm><primary>indexed indirect addressing mode</primary></indexterm>.
It says to start at &eax;
and go &edi; locations forward, with each
location being 1 byte big.  It takes the value found there, and put it in
&cl;.  After this it checks to see if that value is in the range
of lower-case <wordasword>a</wordasword> to lower-case <wordasword>z</wordasword>.
To check the range, it simply checks to see if the letter is smaller than
<wordasword>a</wordasword>.  If it is, it can't be a lower-case letter.  Likewise,
if it is larger than <wordasword>z</wordasword>, it can't be a lower-case letter.
So, in each of these cases, it simply moves on.  If it is in the proper range, it
then adds the uppercase conversion, and stores it back into the buffer.
</para>

<para>
Either way, it then goes to the next value by incrementing %cl;.  Next it checks to
see if we are at the end of the buffer.  If we are not at the end, we jump back to
the beginning of the loop (the <literal>convert_loop</literal> label).  If we are
at the end, it simply continues on to the end of the function.  Because we are
modifying the buffer directly, we don't need to return anything to the calling program - the
changes are already in the buffer.  The label <literal>end_convert_loop</literal>
is not needed, but it's there so it's easy to see where the parts of the program are.
</para>

<para>
Now we know how the conversion process works.  Now we need to figure out how
to get the data in and out of the files.
</para>

<para>
Before reading and writing the files we must open them. The UNIX
<literal>open<indexterm><primary>open</primary></indexterm></literal>
system call is what handles this.  It takes
the following parameters:
</para>

<itemizedlist>
<listitem><para>
&eax-indexed; contains the system call number as usual - 5 in this case.
</para></listitem>
<listitem><para>
&ebx-indexed; contains a pointer to a string that is the name of the file to open.
The string must be terminated with the null character<indexterm><primary>null character</primary></indexterm>.
</para></listitem>
<listitem><para>
&ecx-indexed; contains the options used for opening the file.
These tell Linux how to open the file.  They can indicate things such
as open for reading, open for writing, open for reading and writing, create
if it doesn't exist, delete the file if it already exists, etc.  We will
not go into how to create the numbers for the options until
<xref linkend="truthbinarynumbers" />.  For now, just trust the numbers we come
up with.
</para></listitem>
<listitem><para>
&edx-indexed; contains the permissions
that are used to open the file.
This is used in case the file has to be created first, so Linux knows
what permissions to create the file with.  These are expressed in octal,
just like regular UNIX permissions.<footnote><para>If you aren't familiar
with UNIX permissions, just put <literal>$0666</literal> here.  Don't
forget the leading zero, as it means that the number is an octal<indexterm><primary>octal</primary></indexterm>
number.</para></footnote>
</para></listitem>
</itemizedlist>

<para>
<indexterm><primary>permissions</primary></indexterm>
After making the system call, the file descriptor of the newly-opened file
is stored in &eax-indexed;.
</para>

<para>
So, what files are we opening?  In this example, we will be opening the
files specified on the
command-line<indexterm><primary>command-line</primary></indexterm>.  Fortunately,
command-line parameters
are already stored by Linux
in an easy-to-access location, and are already null-terminated.  When
a Linux program begins, all pointers to command-line arguments are
stored on the stack.  The number of arguments is stored at
<literal>(%esp)</literal>, the name of the program is stored at
<literal>4(%esp)</literal>, and the arguments are stored from
<literal>8(%esp)</literal> on.  In the C Programming language, this
is referred to as the <literal>argv<indexterm><primary>argv</primary></indexterm></literal> array, so we will refer
to it that way in our program.
</para>

<para>
The first thing our program does is save the current stack position
in &ebp; and then reserve some space on the stack to store the file
descriptors.  After this, it starts opening files.
</para>

<para>
The first file the program opens is the input file, which is the first
command-line argument.  We do this by setting up the system call.
We put the file name into &ebx-indexed;, the read-only mode number
into &ecx-indexed;, the default mode of <literal>$0666</literal>
into &edx-indexed;, and the system call number into &eax-indexed;
After the system call, the file is open and the file descriptor
is stored in  &eax-indexed;.<footnote><para>Notice that we don't do any
error checking on this.  That is done just to keep the program simple.
In normal programs, every system call should normally be checked for
success or failure.  In failure cases, &eax; will hold an error
code instead of a return value.  Error codes are negative, so they
can be detected by comparing &eax-indexed; to zero and jumping if it
is less than zero.</para></footnote>  The file
descriptor is then transferred to its appropriate place on the
stack.
</para>

<para>
The same is then done for the output file, except that it is
created with a write-only, create-if-doesn't-exist, truncate-if-does-exist
mode.  Its file descriptor is stored as well.
</para>

<para>
Now we get to the main part - the read/write loop.  Basically, we
will read fixed-size chunks of data from the input file, call our
conversion function on it, and write it back to the output file.
Although we are reading fixed-size chunks, the size of the chunks
don't matter for this program - we are just operating on straight
sequences of characters.  We could read it in with as little or as large of
chunks as we want, and it still would work properly.
</para>

<para>
The first part of the loop is to read the data.  This uses the
<literal>read<indexterm><primary>read</primary></indexterm></literal>
system call.  This call just takes a
file descriptor to read from, a buffer to write into, and the
size of the buffer<indexterm><primary>buffer</primary></indexterm>
(i.e. - the maximum number of bytes that
could be written).  The system call returns the number of bytes
actually read, or end-of-file (the number 0).
</para>

<para>
After reading a block, we check &eax-indexed; for an end-of-file marker.
If found, it exits the loop.  Otherwise we keep on going.
</para>

<para>
After the data is read, the <literal>convert_to_upper</literal> function
is called with the buffer we just read in and the number of characters
read in the previous system call.  After this function executes,
the buffer should be capitalized and ready to write out.  The registers
are then restored with what they had before.
</para>

<para>
Finally, we issue a <literal>write<indexterm><primary>write</primary></indexterm></literal> system call, which is exactly
like the <literal>read</literal> system call, except that it moves the
data from the buffer out to the file.  Now we just go back to the beginning
of the loop.
</para>

<para>
After the loop exits (remember, it exits if, after a read, it detects the
end of the file), it simply closes its file descriptors and exits.  The
close system call just takes the file descriptor to close in &ebx-indexed;.
</para>

<para>
The program is then finished!
</para>

<!-- FIXME - needs to be re-sectionalized and reviewed -->
<!-- FIXME - probably need to start with a "hello world" program -->

</sect1>


<sect1>
<title>Review</title>

<sect2>
<title>Know the Concepts</title>

<itemizedlist>
<listitem><para>Describe the lifecycle of a file descriptor.</para></listitem>
<listitem><para>What are the standard file descriptors and what are they used for?</para></listitem>
<listitem><para>What is a buffer?</para></listitem>
<listitem><para>What is the difference between the <literal>.data</literal> section and the <literal>.bss</literal> section?</para></listitem>
<listitem><para>What are the system calls related to reading and writing files?</para></listitem>
</itemizedlist>

</sect2>

<sect2>
<title>Use the Concepts</title>

<itemizedlist>
<listitem><para>Modify the <literal>toupper</literal> program so that it reads from <literal>STDIN</literal> and writes to <literal>STDOUT</literal> instead of using the files on the command-line.</para></listitem>
<listitem><para>Change the size of the buffer.</para></listitem>
<listitem><para>Rewrite the program so that it uses storage in the <literal>.bss</literal> section rather than the stack to store the file descriptors.</para></listitem>
<listitem><para>Write a program that will create a file called <filename>heynow.txt</filename> and write the words "Hey diddle diddle!" into it.</para></listitem>

</itemizedlist>

</sect2>

<sect2>
<title>Going Further</title>

<itemizedlist>
<listitem><para>What difference does the size of the buffer make?</para></listitem>
<listitem><para>What error results can be returned by each of these system calls?</para></listitem>
<listitem><para>Make the program able to either operate on command-line arguments or use <literal>STDIN</literal> or <literal>STDOUT</literal> based on the number of command-line arguments specified by <literal>ARGC</literal>.</para></listitem>
<listitem><para>Modify the program so that it checks the results of each system call, and prints out an error message to <literal>STDOUT</literal> when it occurs.</para></listitem>
</itemizedlist>

</sect2>

</sect1>

</chapter>