Copyright © 2007-2009 Curt Hill C Style Strings An Specialized Form of Array.
-
Upload
bernadette-obrien -
Category
Documents
-
view
217 -
download
2
Transcript of Copyright © 2007-2009 Curt Hill C Style Strings An Specialized Form of Array.
Copyright © 2007-2009 Curt Hill
Introduction
• The discussion of strings is short on new syntax and long on new functions
• This makes it somewhat easier, I hope
Copyright © 2007-2009 Curt Hill
Strings are Different• Most of the things we have dealt with
are machine constructs:– int, double, char, for, while, functions
• They map very nicely to things that machines handle very well
• However, the human machine interface always has to deal with the notion that people read lines of text
• cout can handle this with \n but cin has problems since to it the blanks and \n or \t are just 'whitespace', whereas to us there is a very different interpretation of blanks and newlines
Copyright © 2007-2009 Curt Hill
Storage
• We also have the problem of storage of strings
• Strings are inherently variable length
• When we read in a line of text we may get any number of actual characters
Copyright © 2007-2009 Curt Hill
What do we do?
• Historically, there are several main approaches to how we will handle these in memory or as file records
• Fixed Length Records• Variable Length Records
– Delimiter– Descriptor
Copyright © 2007-2009 Curt Hill
Fixed length records
• Each item is some positive constant in length– Originally, the most common was 80
character, punch card images
• Items cannot be longer• Shorter items are padded on right
with some character, usually a blank
• This is the FORTRAN approach
Copyright © 2007-2009 Curt Hill
Variable length with delimiter
• Delimiter• There is some special character
that says: I am the end of the string or line
• Usually a control character• This is less general
– Consider object files where any character can legally occur
– Usually there is an escape sequence
Copyright © 2007-2009 Curt Hill
Variable length with descriptor
• Descriptor is an explicit length • There is an integer with the string
which says how large it will be• Usually immediately before first
character, usually one, two or four bytes
• One byte then string is 256 long max
• Two byte then string is 65k max• Four byte then string is 4G
Copyright © 2007-2009 Curt Hill
Storage
• This storage problem is rather vexing from a machine view
• Variable lengths are difficult to allocate on the stack
• We must know the length to access what follows them
• Thus we must allocate a maximum and waste what we do not use
Copyright © 2007-2009 Curt Hill
Examples:
• IBM Mainframe systems employ the first two in file systems
• Fixed length files each record is always the same length
• Card files– Tape or disk as well– This is also possible in C++ with just
ordinary arrays of characters– Standard Pascal, FORTRAN and
COBOL also use this, among others
Copyright © 2007-2009 Curt Hill
Examples (continued)
• IBM Mainframe systems also employed a variable length record, among others– CBuilder AnsiStrings among others– Allocate the maximum number of
bytes and then maintain a length indicator
– File systems do not need to allocate maximum but used length only
Copyright © 2007-2009 Curt Hill
Delimited variable length• UNIX, DOS, Windows use this for
text files• CR, LF or CR/LF is the line delimiter
– UNIX and LINUX uses linefeed– Windows and DOS uses CR/LF– Each file occupies a whole number of
allocation units (sectors or blocks) and the end of the file is marked with a character or character string to mark end of file also
• C/C++ uses this for strings– Null character is delimiter
Copyright © 2007-2009 Curt Hill
Delimiter Again
• Allocate the maximum amount of memory needed for the string
• Use a byte with a binary zero to mark the end
• This is ‘\0’• Nothing after the \0 is considered
as valid contents
Copyright © 2007-2009 Curt Hill
Discussion• All of these approaches are a
concession to how people do things
• They are not neat and clean compared to other kinds of things, such as integers
• Mostly because of the variable length approach
• The delimiter approach resembles the unfull array technique– Built in to string libraries
Copyright © 2007-2009 Curt Hill
Strings usage
• We have already seen strings• A constant string is enclosed in
double quotes whereas an ASCII character constant is one character inside apostrophes
• The string “hi” is how many characters?– Two for hi and one for \0 = 3
Copyright © 2007-2009 Curt Hill
Null character• A byte with a value of zero
– Not the zero digit• Automatically provided by a double
quoted string• May also be supplied by escape
sequence: ‘\0’• Initialization:char str[3] = “Hi”; char c = ‘\0’;char d = 0;
Copyright © 2007-2009 Curt Hill
Common mistake:
• x[5] = “Hello”;• We do not have room for \0
– Should be compile error– Not detected in CBuilder6
• The absence of that can cause runtime errors that will be noted later
• The \0 is always appended to any string in quotes
Copyright © 2007-2009 Curt Hill
Declaration• Declaration of a string is just
the same as declaring an array of characters
• Recall that an array of characters can be handled as a string or any other way consistent with an array of type char
• char str[9]="Hi there";char str[] = “Hi there”;
Copyright © 2007-2009 Curt Hill
Declaration Again• char str[10]="Hi there";
Declares str as a string of length 10– Initializes first nine characters– First eight as above– Ninth with \0– Tenth is undefined
• The only real difference between this any other array is the shorthand for strings:– char str[10] =
{'H','i',' ','t','h','e','r','e',’\0’};
Copyright © 2007-2009 Curt Hill
String usage
• Most other differences between a string and any other array is found in the standard functions
• First we will consider the fstreams• Second two libraries
Copyright © 2007-2009 Curt Hill
cout• cout (and all ofstreams) may handle a
string as we have seen• However, since it does not know the
length, it must search for the Null character to terminate
• If there is no Null it considers the string longer than it actually is until it finds a coincidental Null in memory
• The Null is common in memory, usually, being the first three bytes of positive ints that are small
• Nevertheless, it is easy to get tens, hundreds or thousands of extra bytes displayed
Copyright © 2007-2009 Curt Hill
cin• A different story• In cin whitespace is still skipped• So if you read in the string
– Hello there – will get 6 characters - the Hello plus Null
• The leading whitespace is skipped and the string is terminated with the blank between the o and t
• Solutions:– There are three versions of cin.get that will
be helpful• A no parameter version• A one parameter version • A two or three parameter version
Copyright © 2007-2009 Curt Hill
Get• These are methods of ifstreams• char get(void)
– gets one character and returns it– Does not skip whitespace
• char * get(char &) – gets one character without whitespace
skipping– Returns a parameter that we will mostly
ignore that can be used to indicate success– It is actually a pointer, but we can use it like
an integer where 0 means unsuccessful
Copyright © 2007-2009 Curt Hill
Examples
• Read all the characters:char ch[10000];int i = 0;while(cin) ch[i++] = cin.get();
• Alsochar ch[10000];int i = 0;while(cin.get(ch[i++]));
Copyright © 2007-2009 Curt Hill
String Get• int get(char p[ ], int n, char = ‘\n’)• The initial argument is a string to
read the characters into• n is the maximum number of
characters to obtain• Since this form always terminates
strings with a \0, the maximum number of input characters is only n-1
• Hence cin.get(st,1) only loads the \0
Copyright © 2007-2009 Curt Hill
String Get• The third parameter is a terminator
character• This can be anything, though the default
is an excellent choice• The get will read characters and store
them in p until one of the following conditions is met:– Too many characters– Delimiter is found
• When we are done, if the delimiter was found it will be the next unread character– Hence it will never read a delimiter
Copyright © 2007-2009 Curt Hill
Getline
• int getline(char [], int, char = ‘\n’)• Essentially the same as three
parameter get except it eats the delimiter and does not copy it to the buffer
• This is my favorite
Copyright © 2007-2009 Curt Hill
Examples• Declarationchar line[MAX];
• This will read the line but leave the end of line in the input buffer:cin.get(line,MAX);
• This will read the line, discard the end of line:cin.getline(line,MAX);
• A comma delimited file might be read: cin.getline(line,MAX,’,’);
• No good way to read where two or more different delimiters
Copyright © 2007-2009 Curt Hill
String assignment
• Given char a[10],b[10];• Can we:
– a = b;
• No• Can we:
– a = "Hi there";
• NO• How then do we string assignment?• Like any array manipulation
Copyright © 2007-2009 Curt Hill
The Hard Way• Usually by function call or something
involving a for loop• Like all arrays the following is possible:
char a[10], b[10];
for(i=0;i<10;i++)
a[i]= b[i];
• Or we can define a function to do the same thing:void str_asgn (char target[], const char src[], int size);
Copyright © 2007-2009 Curt Hill
str_asgn
void str_asgn (char target[], const char src[], int size){ int i; for(int i = 0;i<size;i++){ target[i] = src[i]; if(target[i] == 0) break; }}
Copyright © 2007-2009 Curt Hill
Overlapping Arrays
• One of the problems with this function is that overlapping arguments will cause weird results
• For example– str_asgn(&a[1],a,10);
• However, it uses next to no memory
• What actually happens?
Copyright © 2007-2009 Curt Hill
Overlap• Suppose the following array:char a[5] = “hi”;
• And we call: str_asgn(&a[1],a,5);
• Then a[0] is copied to a[1]– This is the ‘h’ which now occupies the
first two characters• Next a[1] is copied to a[2]
– This is the ‘h’ which now occupies the first three characters
Copyright © 2007-2009 Curt Hill
Third Copytarget
source
h h h h *
Copy source[2] to target[2]
You see the pattern.Handy if this is what you want.
Copyright © 2007-2009 Curt Hill
String operations• What can we do to an integer (assume int
i,j;)• Many things
– Comparison: if(i<j)– Arithmetic: i*j-2– Assignment i=j;
• What can we do to two arrays (assume int x[5],y[5])– Next to nothing without resorting to a function
• Should we consider a string an elementary type or a structured type (in this case array)
Copyright © 2007-2009 Curt Hill
Structured Types
• Clearly C/C++ thinks of strings as arrays so we can do next to nothing
• We cannot assign two strings• It seems like we can do nothing to
strings other than write functions that manipulate or use existing functions that manipulate
• Fortunately most of the useful functions have already been written
Copyright © 2007-2009 Curt Hill
Utility string functions
• The first library to consider is string.h
• Inside this are some utility functions that help us to perform string manipulation
• Some of these we will consider and many others not
Copyright © 2007-2009 Curt Hill
strlen• int strlen(const char*source)• Takes a string as an argument and finds
the length of the string• Not physical length but the position of
the \0 character• It is the length of the usable string and
the subscript of the \0 character• Extremely handy• It may overflow
– It may give a logical length greater than the physical length
Copyright © 2007-2009 Curt Hill
memcpy
• The two mem functions are not string functions but array functions
• void *memcpy(char s[ ], const char ct[], const int n)
• copy n chars from ct to s• return pointer to s
Copyright © 2007-2009 Curt Hill
memmove
• Same as memcpy except works if operands overlap
• Moves (copies really) length characters from source to dest.
• Often folds into one machine language instruction
• Does not care about \0, is guided only by length
Copyright © 2007-2009 Curt Hill
Example
• The mem's can be used for gross array movement of any sort
• For example:int a[10], b[10];...memcpy(a,b,10*sizeof(int));– sizeof is an operator that takes an
expression or parenthesized type
Copyright © 2007-2009 Curt Hill
Characteristics
• String functions have a number of characteristics making them easier to remember
• They all start with str – Usually followed by three or four
letters– This is descriptive
• The first parameter is usually a string and the most important one– Only one to be changed
Copyright © 2007-2009 Curt Hill
strcpy
• char * strcpy(char s[], const char ct[])
• Copy ct to s, including the \0• The return value is the pointer to s• No overlap is allowed and there
had better be a \0
Copyright © 2007-2009 Curt Hill
Two Flavors• Almost all string functions come in
two flavors– Brave and bold– Cautious
• The brave version always believes that a null character will be found
• The cautious version takes an additional integer which is the maximum length– Always has an n in the name right
after the str
Copyright © 2007-2009 Curt Hill
strncpy
• char * strncpy(char s[], const char ct[],int n)
• Copy ct to s, including the \0 or at most n characters whichever comes first
• The return value is the pointer to s• No overlap is allowed
Copyright © 2007-2009 Curt Hill
strcat
• Short for concatenate• char * strcat(char s[], const char ct[])
• Copy ct to end of s– The \0 of s is replaced and the end of
the string is supplied from ct
• The return value is the pointer to s• No overlap is allowed and there
had better be a \0
Copyright © 2007-2009 Curt Hill
strncat• char * strcat(char s[], const char ct[], int n);
• Copy ct to end of s– The \0 of s is replaced and the end of
the string is supplied from ct
• Copy at most n characters onto s• The new length is the sum of the
length of s and the copied characters• The return value is the pointer to s• No overlap is allowed
Copyright © 2007-2009 Curt Hill
Recall• All these functions are straight
from the C library• Standard in every implementation
of C/C++ since the 70s• C had no bool until the 90s, so
comparisons return an int• Also functions that return a
character will actually return an int– This will be automatically be cast to
char
Copyright © 2007-2009 Curt Hill
strcmp• Comparison• int strcmp(const char s[], const char t[])
• Compare s to t• Returns
– if s<t returns <0– returns 0 if s==t– if s>t returns >0
• No overlap is allowed and there had better be a \0
Copyright © 2007-2009 Curt Hill
Comparing characters• When two integers are compared, the
whole integer participates• String comparison is somewhat different• We sequentially compare corresponding
characters • The result is the result between the first
pair that is different• A substring is always less than the
larger string• Character comparison is based on
collating sequence
Copyright © 2007-2009 Curt Hill
Example
• Compare two strings:“bbbazz”“bbbbaa”
• First string is less• Compare two strings:
“zzz”“zzza”
• The shorter is less than the longer• “Z” < “a” in ASCII
Copyright © 2007-2009 Curt Hill
strncmp
• int strncmp(const char s[], const char t[],int n)
• Compare first n characters of s and t
• Returns– if s<t– return==0 if s==t– if s>t
• No overlap is allowed
Copyright © 2007-2009 Curt Hill
strchr
• char * strchr(const char s[], const char c)
• Looks for first c in s• Returns the pointer to the
character if found and NULL otherwise
• There had better be a \0
Copyright © 2007-2009 Curt Hill
strrchr
• char * strrchr(const char s[], const char c)– Nearly the same but starts at right
side
• Looks for last c in s• Returns the pointer to the
character if found and NULL otherwise
• There had better be a \0
Copyright © 2007-2009 Curt Hill
Many others• There are many others here as well that
are less important:– strspn– strcspn– strrpbrk– strstr– strerror– strtok– memcmp– memchr– memset
Copyright © 2007-2009 Curt Hill
Utility character functions
• Another library of importance is ctype.h
• These are functions that do something with a single character– Classifies– Converts case
Copyright © 2007-2009 Curt Hill
isalpha
• int isalpha (const char c);• Is the character c a letter (upper or
lower)• Returns 0 for false and 1 for true
Copyright © 2007-2009 Curt Hill
isupper and islower
• int isupper(const char c);• Is c an upper case letter• int islower(const char c);• Is c a lower case letter
Copyright © 2007-2009 Curt Hill
More
• int isdigit(const char c);– Is c a digit
• int isalphanum(const char c);– Is c a letter or digit
• int iscntrl(const char c);– Is c a control character
• int isspace(const char c);– Is c white space (blank, tab,
newline...)
Copyright © 2007-2009 Curt Hill
More• int isprint(const char c);• Is c printable (printables and space)• int ispunct(const char c);• Is c a printing character except
space, letters or digits• int isxdigit(const char c);• Is c a digit in hexadecimal(0-9,A-F)• int isgraph(const char c);• Is c a graphic charactern (printing
except space)
Copyright © 2007-2009 Curt Hill
Conversion
• int tolower(const char c);• Convert c to lower case• If !(isupper(c)) Then c is returned• int toupper(const char c);• Convert c to upper case• If !(islower(c)) Then c is returned
Copyright © 2007-2009 Curt Hill
Advantages
• Strings have several privileges over any other array
• Easy constant array notation– May be used other than in
declarations
• Integrated unfull array scheme
Copyright © 2007-2009 Curt Hill
String Objects
• Despite these advantages the string objects are the better approach
• They allow easy assignment and comparison
• Their methods provide all the extra things needed
• Strings were good for C, but object use is the C++ way