Monday, April 21, 2008

Default Address Selection Part 1

If you are familiar with ipv6 then you'd be aware that default address selection is a very important concept. It was defined in RFC 3484 . Due to space constraint, i have decided to split this topic into 2 parts. The first part will deal with just introduction and how to use this feature. The second part will explain the kernel/glibc internals involved in this implementation. Hope i will write the second part soon. Its advisable to read the RFC before proceeding. To give a brief idea of what default address selection is, i would like to take an example of a host having multiple ipv6 address and needing to decide which address to be used for communication. For a communication to happen there must be a source address and destination address, but the problem arises when there are multiple source and destination address to select from. IPv6 by default allows a hosts to configure multiple addresses, so there is a need for an algorithm to sort this list. We can broadly classify default address selection into 2 types:

1) Default source address selection
2) Default destination address selection

Say a host A wants to communicate with another host B (can be external/internal system), it needs to know the destination ip of host B. To get the destination ip, a dns query is sent to the configured dns server and the response is taken as destination ip. What if the dns reply has multiple ip's to the same domain name? That is when destination address selection comes into picture. Now that we "somehow" selected the destination ip, we now need to select appropriate source ip. A question that might arise is, why do we need to do that? Cant we just pick the first ip from the list of source ip's and start the communication? The answer is no. This is because IPv6 ip's can be link-local , site-local or global ip. If the destination ip is a global ip and first source ip we select from the list is a link-local ip then obviously the communication cannot happen because of scope mismatch. So we need some intelligent algorithm to select the correct source ip.

Another interesting aspect in destination address selection is to decided which ip to use if the dns query returns an IPv4 as well as IPv6 address. There needs to be some factor to decide this selection. More on all this in Part 2 :-) . So, we have a situation where these addresses are selected based on a certain criteria. By default the criteria's are as per RFC. For most users this should hold good, but what if it needs to be changed? Let's say by default, IPv6 address is given more precedence than IPv4 , but the administrator wants IPv4 as higher precedence. In these cases there needs to be a way to configure source address selection and destination address selection. For this reason RFC defines User Configuration Tables for Source/Destination selection.
Before we go into the configuration tables lets look at a basic fact. The source address selection is implemented in the kernel and destination address selection in glibc. Wonder why? The reason is very simple. Glibc implements dns query api's like gethostbyname and family which triggers the dns query. So it is obvious that this api will get all the replies as well. It makes sense to implement the algorithm in glibc api's.

Lets look at the user configuration tables for both source address selection and destination address selection. There is an interesting article from the glibc maintainer Ulrich Drepper . You can find the article here .

Basic Requirements:

- Linux Kernel 2.6.24 or higher
- iproute2 utilities compiled for 2.6.24 (Check to see if "#ip help" supports 'addrlabel')
Once we have the prerequisites we are good to go.

User Configuration Table For Source Address Selection :
[root@t6018ab-009124035140 ip]# ./ip addrlabel show
prefix ::1/128 label 0
prefix ::/96 label 3
prefix ::ffff:0.0.0.0/96 label 4
prefix 2001::/32 label 6
prefix 2002::/16 label 2
prefix fc00::/7 label 5
prefix ::/0 label 1

This is the default source address user configuration table. The "label" field is a very important aspect of the table. The prefix with lower label value is given higher preference than the one with higher. For example prefix ::1 is given the highest preference when it is prefix label matching.
Lets say we have two prefix of same type
prefix 2003:470:1f00:ffff::4/64 label 8
prefix 2003:470:1f00:ffff::5/64 label 8
Source Address Selection List:
2003:470:1f00:ffff::4
2003:470:1f00:ffff::5
2003:470:1f00:ffff::6

Destination Address
2003:470:1f00:ffff::7

Irrespective of the order of the source address list the ip 2003:470:1f00:ffff::6 will be selected as the correct source candidate since the other two address have a label value of 8 where as 2003:470:1f00:ffff::6 will pass on the rule "prefix ::/0 label 1". Thus the lowest label value will be given higher priority. We can play around with giving different label value to different prefixes. Since source address selection works in conjunction with destination address selection ,we shall look into testing this aspect a little later.

User Configuration Table For Destination Address Selection :

The destination address user configuration table is based on a conf file called gai.conf. This is placed in /etc/. Distros dont place this file here for a certain reason. For more information please read the article by Ulrich Drepper as stated above. In my system the gai.conf file is located in /usr/share/doc/glibc-common-2.6/gai.conf. This file must be coped to /etc/ if you intend to change the default behavior.

A typical gai.conf file



# label
# Add another rule to the RFC 3484 label table. See section 2.1 in
# RFC 3484. The default is:
#
#label ::1/128 0
#label ::/0 1
#label 2002::/16 2
#label ::/96 3
#label ::ffff:0:0/96 4
#label fec0::/10 5
#label fc00::/7 6

#

# precedence
# Add another rule the to RFC 3484 precedence table. See section 2.1
# and 10.3 in RFC 3484. The default is:
#
#precedence ::1/128 50
#precedence ::/0 40
#precedence 2002::/16 30
#precedence ::/96 20
#precedence ::ffff:0:0/96 10
#
# For sites which prefer IPv4 connections change the last line to
#
#precedence ::ffff:0:0/96 100


For destination address selection, two main criteria's to be considered are label and precedence. It must always be remembered that precedence is associated with destination address selection only. Where as label is common for both source and destination address selection. It is for this reason both the tables must remain in sync for correct result.

Testing Destination Address Selection

To test destination selection algorithm we need to write a small program to test it. The best way to test the destination address selection algorithm is to use the examples given in RFC 3484. See section 10.2

Few Requirements:

- Add a entry "multi on" in /etc/host.conf
- Stop the name service caching daemon (service nscd stop)
- Compile the program given below (This will test the result of default address selection)


#include <errno.h>
#include <error.h>
#include <netdb.h>
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <unistd.h>
#include <netinet/in.h>
#include <sys/socket.h>

char buf[INET6_ADDRSTRLEN];

int
main(int argc, char *argv[])
{
int err;
struct addrinfo *ai;
struct addrinfo hints;
struct addrinfo *runp;
int sock;

memset(&hints, '\0', sizeof(hints));
hints.ai_protocol = IPPROTO_TCP;

// dummy gethostbyname call so that /etc/host.conf is read
gethostbyname(argv[1]);

err = getaddrinfo(argv[1], "", &hints, &ai);
if (err != 0)
error(EXIT_FAILURE, 0, "getaddrinfo(%d): %s", err,
gai_strerror(err));
runp = ai;
while (runp != NULL) {
getnameinfo(runp->ai_addr, runp->ai_addrlen, buf,
INET6_ADDRSTRLEN, NULL, 0, NI_NUMERICHOST);

printf("family:%2d socktype:%2d protocol:%3d addr:%s(%d)\n",
runp->ai_family, runp->ai_socktype, runp->ai_protocol,
buf, runp->ai_addrlen);
runp = runp->ai_next;
}

freeaddrinfo(ai);
}




Example taken from section 10.2 of the RFC:
Candidate Source Addresses: 2001::2 or fec0::2 or fe80::2
Destination Address List: 2001::1 or fec0::1 or fe80::1
Result: fe80::1 (src fe80::2) then fec0::1 (src fec0::2) then 2001::1 (src 2001::2) (prefer smaller scope)

The destination address selection will be demonstrated using a example from RFC.
The first step is to add multiple dns entry in the dns server. This is big process, so i will use /etc/hosts file to make things simple (This works similar to dns server replies).

So add the following in /etc/hosts
fec0::1 rockon
2001::1 rockon
fe80::1 rockon

Add source addresses to the interface
#ip -6 addr add 2001::2 dev eth0
Similarly for fec0::2 and fe80::2

Next step is to make sure every destination route added in /etc/hosts must have valid route entry. Else the above will not work.
For Eg : fec0::1 is the destination ip. So the algorithm will choose this only if we have a valid route for this ip.
#ip -6 route add fec0::1 dev eth0
Similarly add routes for the other destination candidates.

To execute the program
#./a.out rockon
family:10 socktype: 1 protocol: 6 addr:fe80::1(28)
family:10 socktype: 1 protocol: 6 addr:fec0::1(28)
family:10 socktype: 1 protocol: 6 addr:2001::1(28)

The result shows the order in which destination addresses are sorted. Rest of the examples can be tried out. The destination user configuration table (gai.conf) can be modified to see different results.


Testing Source Address Selection :


Lets look at how to test source address selection functionality. The best way to do so is to follow the test cases specified in RFC 3484. See section 10.1.
For testing source address selection use ping6.

Example taken from section 10.1 of the RFC
Destination: 2001::1
Candidate Source Addresses: 3ffe::1 or fe80::1
Result: 3ffe::1 (prefer appropriate scope)

Configure IPv6 address for interface eth0
#ip -6 addr add 3ffe::1 dev eth0
fe80::1 can be ignored as you will have by default a linklocal address
Add a valid route to the destination ip.
#ip -6 route add 2001::1 dev eth0
#ping6 2001::1

The result will be destination unreachable if 2001::1 doesnt exits. But thats not our issue. The unreachable message will show what source address is selected. This is how one can test the source address selection algorithm. Try out all the different examples given in RFC.Now, by tweaking the user configuration table as mentioned in "User Configuration Table For Source Address Selection" we can modify the behavior.

Hope this little write was useful in understanding how address selection works. In the part 2 article i will explain how the algorithm work.

Update: Thanks to Brandon for pointing out a mistake in the post. Check comments for details.

Thursday, April 17, 2008

Division Between Users And Kernel Hackers On Git Bisect

The source code management tool git has come under scanner again. This time for a different reason. Flame war's are pretty common in linux community. Everytime there is a divided opinion on certain things, it unlocks a fury of mails from the community guru's. What happened this time is no different. It all started when Mark Lord reported a regression in the network stack. Even after a few mail exchanges it was not clear what the cause of problem was. So the netdev guru's asked Mark to "bisect" and arrive at the culprit patch. Mark responded furiously saying that he didnt have time or inclination to do such a thing. He argued that he was only a bug reporter and is not his job to do the bisection. This triggered the whole issue of who does what. It was exchange of heated arguments over mail and few humorous stories to support the claims. No one can forget the "Doctor Patient" story. To sum up the argument, main focus of this whole episode was who is responsible for such regressions. Lets know a little bit of git bisect. Git bisect is used to find a possible cause of the problem. It works on a simple principle, the bug hunter has to know which kernel is working well and which has bug. For example, 2.6.24 is not having any problem but 2.6.25 does, in this case one can use bisect to choose a version somewhere in the middle of these two release. Once done, the bug is to be verified and if found, git bisect has to be run again with the first half release. To make things easy take the same example as above. Bisecting this showed that 2.6.25-rc4 had the issue. So its now clear that the bug was introduced somewhere between 2.6.24 and 2.6.25-rc4. So running git bisect will narrow down even further and this will continue till we narrow down on a particular commit. This will help identify the bug. But the process is time consuming. To lay fact down plain and simple, git bisect need not necessarily narrow down on the correct patch which causes the problem. There is a possibility that problem was created else where and came into light on introducing this "culprit" patch. So this is not a sure shot way to identify the problem. In our issue Mark states that a user identifying a problem must only report it and that where his/her duty ends. It is upto the individual user to do some more homework and help the developers fix the bug. The developers argue that user will be asked to bisect as a last resort. By forcing users to do more work than just reporting bug can cause them to stop reporting bugs which is not a good thing for the community. On the other hand well known kernel hackers like David Miller claims that it is unavoidable sometimes due to unavailability of hardware the user had used in his environment. This requires the user to cooperate in this effort. As one can see, both side do have a strong point to argue. Its difficult to take sides here. This can become a major issue if not resolved quickly. Some suggestions made by Al Viro and James Morris suggest that the subsystem maintainers need to be more careful in committing patches. This can avoid most regression. Another question that was discussed was, how does a user decide which are "real" bugs? What happens in a complex code like the kernel is, bugs can arise due to some faulty hardware which nobody else faces. When that happens it is virtually impossible to fix it. In such cases the bug remains unfixed. It only come to light if multiple users complain of the same problem. The community urged its users to properly test before posting bugs on the mailing list. The story has not concluded yet as there is no clear solution to this problem. It is left to be seen as to how the community will tackle this issue.