Erasure coding in Xrootd

Erasure coding 101 and architecture in xrootd

Erasure coding 101

The Reed–Solomon error correction provides the base for Erasure Coding (EC). The basic idea is that for n bits of data X = (x1, x2, ... xn), one can provide m bits of parity P = (p1, p2, ... pm) using an m x n matrix S so that

S XT = PT

If up to m bits in X are missing, those missing bits can be calculated using the above formula. There are several ways to choose the matrix S. Those details are beyond the scope of this document.

Xrootd uses Intel’s Intelligent Storage Acceleration Library (ISA-L) for erasure coding. The functionality EC is performed in xrootd's client library XrdCl. On each xrootd server, data chunks that belong to the same file are stored in a single xrootd ZIP archive, using the same file name.

Architecture of using EC in xrootd

Because the EC function is implemented in the xrootd client library, a vanilla xrootd cluster isn't actually aware of EC at all. It just provides storage for EC enabled xrootd client to store data. The xrootd cluster needs at least ** n + m ** data servers.

There are two ways to use EC in a xrootd cluster enviroment:

  1. Using up-to-dated xrootd client tools such as xrdcp, xrdfs and xrootdfs. Because the up-to-dated xrootd clients are required, this method is usually only good for adminstrators on LAN.
  2. Putting the xrootd cluster behind a xrootd proxy (DTN). The XrdCl (and its plugin XrdEC) componments in the proxy handles the EC. This is the perferred way as it works with any xrootd or http clients. The proxy also supports WLCG TPC.

This document assumes that the administrators will use EC enabled xrdcp/xrdfs/xrootdfs for tasks such as restore data from drive failure, moving and deleting data, etc. Users will use the xrootd proxy to access data.

The nature of EC means that it is better to treat EC in xrootd as an object store, rather than a posix storage. Though some posix IO functions (especially reading) are still supported via the xroot protocol, it is better to think of object store functionalities: upload, download, stat, dirlist, renaming, deletion, plus WLCG TPC.

Erasure coding configuration in xrootd

The backend xrootd storage cluster

Since the backend xrootd cluster isn't aware of EC, there is no extra, EC related xrootd configuration other than setting up a plain xrootd cluster using a relatively new xrootd release (e.g. xrootd 5.3.0+) . There is a few things to consider:

  1. The xrootd cluster needs at least ** n + m ** data servers.
  2. The file systems used by the xrootd storage cluster needs to support extended attributes.
  3. EC already provides data redundancy. This reduces the need of using RAID. If RAID is not used, one can xrootd's oss.space directive to make indivudual filesystems available in xrootd. This method actually provides an extra benefit when identifying damaged files due to hardware failure.

Enabling EC in xrootd clients

Xrootd clients that directly facing the backend storage will need to load the XrdEC library. These clients include adminsitrative tools xrdcp/xrdfs/xrootdfs and user facing xrootd proxy. To enable EC in xrootd clients:

  1. use a xrootd release that includes libXrdEC.so (likely xrootd 5.4.1).
  2. create a file /etc/xrootd/client.plugins.d/xrdec.conf and put the following in this file:
url = root://*
lib = XrdEcDefault
enable = true
nbdta = 8
nbprt = 2
chsz = 1048576

Here nbdta refers to the number of data chunks in an EC stripe, and nbprt refers to the number of parity chunks in the stripe. chsz referst to the chunk size (Bytes).

The extension of the file must be .conf.

Tips

  1. Even with XrdEC in the setting, you may want to access an individual xrootd data server to check something. In that case, use the xroot://... protocol because the above config file won't load XrdEC except for root://... protocol.
  2. If you can't add the above config file, you can use unix env. var. XRD_PLUGINCONFDIR to point to the directory that holds this file.

Configuring a xrootd proxy using EC.

In addition to the requirement in the Enabling EC in xrootd clients section, please refers to the WLCG TPC example configuration document for xrootd proxy (DTN) configuration. For EC work (and perform well), some minor change to the configuration is needed:

  1. Xrootd client's default parameters are tunned with WAN in mind. Since the xrootd proxy and EC storage will likely be on the same LAN, we suggest adding the following configuration lines to the end:
setenv XRD_TIMEOUTRESOLUTION = 15
setenv XRD_CONNECTIONWINDOW = 15
setenv XRD_CONNECTIONRETRY = 2
setenv XRD_REQUESTTIMEOUT = 120
setenv XRD_STREAMERRORWINDOW = 0

  1. On the proxy server, an external checksum script and xrootd TPC script are needed (examples below). To use a external checksum script, use the following directive in xrootd configuration file.
xrootd.chksum adler32 /etc/xrootd/xrdadler32.sh

An example of /etc/xrootd/xrdadler32.sh

#!/bin/sh

# Make sure XrdEC is loaded, and there is no security issue.
# Otherwise this script will not give the correct result.

cksmtype="adler32"
file=$1

# For testing purpose: manually supply XRDXROOTD_PROXY (format: host:port)
# Note: when used by xrootd proxy, XRDXROOTD_PROXY is aleady defined.
[ -z "$2" ] || export XRDXROOTD_PROXY=$2

# If this is a xrootdfs path, try getting XRDXROOTD_PROXY that way.
if [ -z "$XRDXROOTD_PROXY" ]; then
    file=$(getfattr -n xroot.url --only-value $1 2>/dev/null)

    # too bad, this is not a xrootdfs path
    [[ $file = root://* ]] || exit 1

    XRDXROOTD_PROXY=$(echo $file | sed -e 's+^root://++g' | awk -F\/ '{print $1}')
    file=$(echo $file | sed -e "s+^root://++g; s+^$XRDXROOTD_PROXY++g")
fi

hostlist=""
strpver=0
for h in $(xrdfs $XRDXROOTD_PROXY locate -h "*" | awk '{print $1}'); do
    ver=$(xrdfs $h xattr $file get xrdec.strpver 2>/dev/null | tail -1 | awk -F\" '{print $2}')
    [ -z "$ver" ] && continue          # this host does not have the strpver, or does not have the file
    [ $ver -lt $strpver ] && continue  # this host has an older strpver, ignore! 
    if [ $ver -eq $strpver ]; then
        hostlist="$hostlist $h"
    else 
        strpver=$ver
        hostlist="$h"
        cksm=$(xrdfs $h xattr $file get xrdec.${cksmtype} 2>/dev/null | tail -1 | awk -F\" '{print $2}')
    fi  
done

if [ -z "$cksm" ]; then
    cksm=$(xrdcp -C ${cksmtype}:print -s -f root://$XRDXROOTD_PROXY/$file /dev/null 2>&1 | \
         head -1 | \
         awk '{printf("%8s",$2)}' | \
         sed -e 's/\ /0/g')
    for h in $hostlist; do
        xrdfs $h xattr $file set xrdec.${cksmtype}=$cksm 
    done
fi
echo $cksm

Do not forget to chmod +x /etc/xrootd/xrdadler32.sh. This script can also be used by xrootdfs to verify checksum in an XrdEC storage.

An example of /etc/xrootd/xrdcp-tcp.sh

#!/bin/sh
set -- `getopt S: -S 1 $*`
while [ $# -gt 0 ]
do
  case $1 in
  -S)
      ((nstreams=$2-1))
      [ $nstreams -ge 1 ] && TCPstreamOpts="-S $nstreams"
      shift 2
      ;;
  --)
      shift
      break
      ;;
  esac
done

src=$1
dst=$2

xrdcp --server -s $TCPstreamOpts $src - | xrdcp -s - root://$XRDXROOTD_PROXY/$dst

Do not forget to chmod +x /etc/xrootd/xrdcp-tpc.sh.

Adminstration of Xrootd EC

Avoid launching a DoS attack

XrdEC will create TCP links to all backend server nodes at the same time. When doing so using xrdcp or xrdfs in script to process many files, a large number of TCP connections will be created and then destroyed in short period of time. Some network switches may view this as a DoS attack. On the other hand, this is not a problem with xrootd proxy or xrootdfs as them maintain and reuse network connections.

Patch or upgrade the EC cluster

If restart is needed after patching, or if a new xrootd version is needed, follow the procedure below:

  1. Restarting a backend redirector or frontend proxy redirector will cause new connections to freeze for a while until the redirector is back. There is no other impact.
  2. To restart a backend data server, first stop the cmsd service running on the node. Waiting until there is no activities in xrootd service (e.g. no TPC connections). then stop the xrootd service. Bring both services back afterward.
    • At most m number of data servers can be brought offline at the same time.
    • The more data servers are brought offline at the same time, the less protect during that period.
    • If there is a xrootdfs mounting the backend cluster, avoid using the xrootdfs instance during these period. At the minimum, send a SIGUSR1 to the xrootdfs process after bring down and bring up the cmsd on the data servers.
  3. Similarly, once can use the above procedure for proxy servers in the proxy cluster.

What to expect when failures happen

When a data server goes down, all files with corresponding zip files on that server will more or less be effected. When a disk (real or raided disk) fails, all files on the disk will more or less be effected. Base on our observation, the impact for read and write are different:

  • For users doing a reading from the failed component, expect a freeze of length XRD_CONNECTIONWINDOW * (XRD_CONNECTIONRETRY -1)
  • For users doing a writing to the failed component, the writing may continue but eventually will failed, with a error message of “connection refused” during the writing stage or during the close stage (and rarelly, during the opening stage)
  • In case of a server failuer, new write initialized (opened before failure, but start writing after the failure) within T seconds after the outage may fail if the failed server is used for writing, where T = max (XRD_CONNECTIONWINDOW * (XRD_CONNECTIONRETRY -1), XRD_STREAMERRORWINDOW)
    • Writing after that period will continue as long as there are at least n + m data servers.

How to identify and repair damaged files

Damaged file here referes to files that lost a zip file due to disk or data server failure. Note that if a EC is configured as n + m, then each server will at most host one of those n + m zip file. This zip file is stored on one of file systems/disks on that server.

A damaged file has a degradated level of protection. But it is still available because there is still n + m -1 corresponding zip files.

Identify damaged files due to a disk failure

Let's first look at the scenario when a disk fails. Suppose a server has:

  • two disks /dev/sda and /dev/sdb mounted at /disk/sda and /disk/sdb
  • The likely xrootd configuration file will have the following lines:
all.export /data
oss.space public /disk/sda
oss.space public /disk/sdb

In this case, /data hosts the name space of all (zip) files storage on this server. Under /data is the full directory tree containing symlinks. The symlinks point to actual files in /disk/sda and /disk/sdb. For example

/data/dir1/file1 -> /disk/sdb/public/DA/CA64DA61306C00000000864f8117DA9200000A6%

If /dev/sdb fails, one can identify all files under /data which are symlinks to /disk/sdb. These files will need to be repaired.

If /data is lost, we can recover the name space in /data because each file in /disk/sda and /disk/sdb actually stores in its extended attribute the info of where it belongs to in the name space.

$ getfattr -n user.XrdFrm.Pfn /disk/sdb/public/DA/CA64DA61306C00000000864f8117DA9200000A6%
getfattr: Removing leading '/' from absolute path names
# file: disk/sdb/public/DA/CA64DA61306C00000000864f8117DA9200000A6%
user.XrdFrm.Pfn="/data/dir1/file1"

By going through all files in /disk/sda and /disk/sdb, we can reconstruct the directory tree under /data.

Identify damaged files due to a server failure

Server failure should not automatically trigger data repair procedure. This is becasue the operation can continue without this failed server. And if the server failure is not due to disk failuer, no data are lost.

In the rare case when the server and its disks are all lost, one will need to check all files on other servers, and see which files are missing a zip file.

Repair damaged files

With a list of files to be repaired in hand, one can now start the repair procedure. This procedure can be summarized as the following steps:

  1. identity the files that need to be repaired
  2. copy each file to a new name e.g. myfile.new
  3. (optionally) compare the checksum of the old and new files
  4. delete the old file
  5. rename the new file to the old name

There are many ways and tools to that can be used for the procedure. The following describe how to use xrootdfs for repairing.

  • mount the EC storage via xrootdfs in an host: e.g. xrootdfs /data -o nworkers=20 -o rdr=root://my_redirector//data -o direct_io

    • Make sure that the XrdEC configuration file in the Enabling EC in xrootd clients is availble. The best way to check is to see if you get the correct size of a file in the xrootdfs mounted directory tree.
    • If you have 20 data servers, give 20 or a little more to nworkers
  • Copying file. One can use any unix command for copying, for example: dd if=myfile of=myfile.new bs=1M iflag=direct oflag=direct. Copying tools that utilize direct IO for input and output, and large block size will perform better.

  • (optional) valide the the checksum. Usually existing files already have some kind of checksum calculated and stored. For example, if adler32 is used, one can use /etc/xrootd/xrdadler32.sh myfile and /etc/xrootd/xrdadler32.sh myfile.new to valid the checksum.
  • use 'rm' and 'mv' to delete the old file and rename the new file.

The following script will automate the above steps. It takes a to be repaired file (on xrootdfs), and repair its

$ cat repair.sh
#!/bin/sh

file=$1
CKSM="/etc/xrootd/xrdadler32.sh"  # this is the checksum script mentioned above
nfile="${file}.newcopy"
dd if=$file of=$nfile bs=1M iflag=direct oflag=direct >/dev/null 2>&1
srccksm=$($CKSM $file)
dstcksm=$($CKSM $nfile)
if [ "$srccksm" == "$dstcksm" ]; then
  echo [O] $file
  rm $file
  mv $nfile $file
  exit
else
  echo [X] $file
  rm $nfile
  exit 1
fi

Identify debris left behind

This happens when a data server is offline while a file was deleted, and the file was copied back later. XrdEC will automatically ignore these "debris" (zip files).

XrdEC SHOULD print out them for cleaning: more on this later

Identify new files for backup

XrdEC SHOULD log all newly create files: more on this later