Erasure coding in Xrootd
Erasure coding 101 and architecture in xrootd
Erasure coding 101
The Reed–Solomon error correction provides the base for Erasure Coding (EC). The basic idea
is that for n
bits of data X = (x1, x2, ... xn),
one can provide m
bits of parity P = (p1, p2, ... pm)
using an m x n
matrix S so that
S XT = PT
If up to m
bits in X are missing, those missing bits can be calculated using the
above formula. There
are several ways to choose the matrix S. Those details are beyond the scope of this
document.
Xrootd uses Intel’s Intelligent Storage Acceleration Library (ISA-L) for erasure coding. The functionality EC is performed in xrootd's client library XrdCl. On each xrootd server, data chunks that belong to the same file are stored in a single xrootd ZIP archive, using the same file name.
Architecture of using EC in xrootd
Because the EC function is implemented in the xrootd client library, a vanilla xrootd cluster isn't actually aware of EC at all. It just provides storage for EC enabled xrootd client to store data. The xrootd cluster needs at least ** n + m ** data servers.
There are two ways to use EC in a xrootd cluster enviroment:
- Using up-to-dated xrootd client tools such as
xrdcp
,xrdfs
andxrootdfs
. Because the up-to-dated xrootd clients are required, this method is usually only good for adminstrators on LAN. - Putting the xrootd cluster behind a xrootd proxy (DTN). The XrdCl (and its plugin XrdEC) componments in the proxy handles the EC. This is the perferred way as it works with any xrootd or http clients. The proxy also supports WLCG TPC.
This document assumes that the administrators will use EC enabled xrdcp/xrdfs/xrootdfs
for tasks such as restore data from drive failure, moving and deleting
data, etc. Users will use the xrootd proxy to access data.
The nature of EC means that it is better to treat EC in xrootd as an object store, rather than a posix storage. Though some posix IO functions (especially reading) are still supported via the xroot protocol, it is better to think of object store functionalities: upload, download, stat, dirlist, renaming, deletion, plus WLCG TPC.
Erasure coding configuration in xrootd
The backend xrootd storage cluster
Since the backend xrootd cluster isn't aware of EC, there is no extra, EC related xrootd configuration other than setting up a plain xrootd cluster using a relatively new xrootd release (e.g. xrootd 5.3.0+) . There is a few things to consider:
- The xrootd cluster needs at least ** n + m ** data servers.
- The file systems used by the xrootd storage cluster needs to support extended attributes.
- EC already provides data redundancy. This reduces the need of using RAID. If
RAID is not used, one can xrootd's
oss.space
directive to make indivudual filesystems available in xrootd. This method actually provides an extra benefit when identifying damaged files due to hardware failure.
Enabling EC in xrootd clients
Xrootd clients that directly facing the backend storage will need to load the XrdEC
library. These clients include adminsitrative tools xrdcp/xrdfs/xrootdfs
and user
facing xrootd proxy. To enable EC in xrootd clients:
- use a xrootd release that includes
libXrdEC.so
(likely xrootd 5.4.1). - create a file /etc/xrootd/client.plugins.d/xrdec.conf and put the following in this file:
url = root://*
lib = XrdEcDefault
enable = true
nbdta = 8
nbprt = 2
chsz = 1048576
Here nbdta
refers to the number of data chunks in an EC stripe, and nbprt
refers
to the number of parity chunks in the stripe. chsz
referst to the chunk size (Bytes).
The extension of the file must be .conf.
Tips
- Even with XrdEC in the setting, you may want to access an individual xrootd data server to check something. In that case, use the xroot://... protocol because the above config file won't load XrdEC except for root://... protocol.
- If you can't add the above config file, you can use unix env. var. XRD_PLUGINCONFDIR to point to the directory that holds this file.
Configuring a xrootd proxy using EC.
In addition to the requirement in the Enabling EC in xrootd clients section, please refers to the WLCG TPC example configuration document for xrootd proxy (DTN) configuration. For EC work (and perform well), some minor change to the configuration is needed:
- Xrootd client's default parameters are tunned with WAN in mind. Since the xrootd proxy and EC storage will likely be on the same LAN, we suggest adding the following configuration lines to the end:
setenv XRD_TIMEOUTRESOLUTION = 15
setenv XRD_CONNECTIONWINDOW = 15
setenv XRD_CONNECTIONRETRY = 2
setenv XRD_REQUESTTIMEOUT = 120
setenv XRD_STREAMERRORWINDOW = 0
- On the proxy server, an external checksum script and xrootd TPC script are needed (examples below). To use a external checksum script, use the following directive in xrootd configuration file.
xrootd.chksum adler32 /etc/xrootd/xrdadler32.sh
An example of /etc/xrootd/xrdadler32.sh
#!/bin/sh
# Make sure XrdEC is loaded, and there is no security issue.
# Otherwise this script will not give the correct result.
cksmtype="adler32"
file=$1
# For testing purpose: manually supply XRDXROOTD_PROXY (format: host:port)
# Note: when used by xrootd proxy, XRDXROOTD_PROXY is aleady defined.
[ -z "$2" ] || export XRDXROOTD_PROXY=$2
# If this is a xrootdfs path, try getting XRDXROOTD_PROXY that way.
if [ -z "$XRDXROOTD_PROXY" ]; then
file=$(getfattr -n xroot.url --only-value $1 2>/dev/null)
# too bad, this is not a xrootdfs path
[[ $file = root://* ]] || exit 1
XRDXROOTD_PROXY=$(echo $file | sed -e 's+^root://++g' | awk -F\/ '{print $1}')
file=$(echo $file | sed -e "s+^root://++g; s+^$XRDXROOTD_PROXY++g")
fi
hostlist=""
strpver=0
for h in $(xrdfs $XRDXROOTD_PROXY locate -h "*" | awk '{print $1}'); do
ver=$(xrdfs $h xattr $file get xrdec.strpver 2>/dev/null | tail -1 | awk -F\" '{print $2}')
[ -z "$ver" ] && continue # this host does not have the strpver, or does not have the file
[ $ver -lt $strpver ] && continue # this host has an older strpver, ignore!
if [ $ver -eq $strpver ]; then
hostlist="$hostlist $h"
else
strpver=$ver
hostlist="$h"
cksm=$(xrdfs $h xattr $file get xrdec.${cksmtype} 2>/dev/null | tail -1 | awk -F\" '{print $2}')
fi
done
if [ -z "$cksm" ]; then
cksm=$(xrdcp -C ${cksmtype}:print -s -f root://$XRDXROOTD_PROXY/$file /dev/null 2>&1 | \
head -1 | \
awk '{printf("%8s",$2)}' | \
sed -e 's/\ /0/g')
for h in $hostlist; do
xrdfs $h xattr $file set xrdec.${cksmtype}=$cksm
done
fi
echo $cksm
Do not forget to chmod +x /etc/xrootd/xrdadler32.sh
. This script can also be used by
xrootdfs
to verify checksum in an XrdEC storage.
An example of /etc/xrootd/xrdcp-tcp.sh
#!/bin/sh
set -- `getopt S: -S 1 $*`
while [ $# -gt 0 ]
do
case $1 in
-S)
((nstreams=$2-1))
[ $nstreams -ge 1 ] && TCPstreamOpts="-S $nstreams"
shift 2
;;
--)
shift
break
;;
esac
done
src=$1
dst=$2
xrdcp --server -s $TCPstreamOpts $src - | xrdcp -s - root://$XRDXROOTD_PROXY/$dst
Do not forget to chmod +x /etc/xrootd/xrdcp-tpc.sh
.
Adminstration of Xrootd EC
Avoid launching a DoS attack
XrdEC will create TCP links to all backend server nodes at the same time.
When doing so using xrdcp
or xrdfs
in script to process many files,
a large number of TCP connections will be created and then destroyed in
short period of time. Some network switches may view this as a DoS attack.
On the other hand, this is not a problem with xrootd proxy or xrootdfs
as them maintain and reuse network connections.
Patch or upgrade the EC cluster
If restart is needed after patching, or if a new xrootd version is needed, follow the procedure below:
- Restarting a backend redirector or frontend proxy redirector will cause new connections to freeze for a while until the redirector is back. There is no other impact.
- To restart a backend data server, first stop the
cmsd
service running on the node. Waiting until there is no activities inxrootd
service (e.g. no TPC connections). then stop thexrootd
service. Bring both services back afterward.- At most
m
number of data servers can be brought offline at the same time. - The more data servers are brought offline at the same time, the less protect during that period.
- If there is a
xrootdfs
mounting the backend cluster, avoid using thexrootdfs
instance during these period. At the minimum, send aSIGUSR1
to thexrootdfs
process after bring down and bring up thecmsd
on the data servers.
- At most
- Similarly, once can use the above procedure for proxy servers in the proxy cluster.
What to expect when failures happen
When a data server goes down, all files with corresponding zip files on that server will more or less be effected. When a disk (real or raided disk) fails, all files on the disk will more or less be effected. Base on our observation, the impact for read and write are different:
- For users doing a reading from the failed component, expect a freeze of
length
XRD_CONNECTIONWINDOW * (XRD_CONNECTIONRETRY -1)
- For users doing a writing to the failed component, the writing may continue but eventually will failed, with a error message of “connection refused” during the writing stage or during the close stage (and rarelly, during the opening stage)
- In case of a server failuer, new write initialized (opened before failure,
but start writing after the failure) within T seconds after the outage
may fail if the failed server is used for writing, where
T = max (XRD_CONNECTIONWINDOW * (XRD_CONNECTIONRETRY -1), XRD_STREAMERRORWINDOW)
- Writing after that period will continue as long as there are at least
n + m
data servers.
- Writing after that period will continue as long as there are at least
How to identify and repair damaged files
Damaged file here referes to files that lost a zip file due to disk or
data server failure. Note that if a EC is configured as n + m
, then each
server will at most host one of those n + m
zip file. This zip file
is stored on one of file systems/disks on that server.
A damaged file has a degradated level of protection. But it is still available
because there is still n + m -1
corresponding zip files.
Identify damaged files due to a disk failure
Let's first look at the scenario when a disk fails. Suppose a server has:
- two disks
/dev/sda
and/dev/sdb
mounted at/disk/sda
and/disk/sdb
- The likely xrootd configuration file will have the following lines:
all.export /data
oss.space public /disk/sda
oss.space public /disk/sdb
In this case, /data
hosts the name space of all (zip) files storage on
this server. Under /data
is the full directory tree containing symlinks.
The symlinks point to actual files in /disk/sda
and /disk/sdb
. For example
/data/dir1/file1 -> /disk/sdb/public/DA/CA64DA61306C00000000864f8117DA9200000A6%
If /dev/sdb
fails, one can identify all files under /data
which are symlinks
to /disk/sdb
. These files will need to be repaired.
If /data
is lost, we can recover the name space in /data
because
each file in /disk/sda
and /disk/sdb
actually stores in its extended
attribute the info of where it belongs to in the name space.
$ getfattr -n user.XrdFrm.Pfn /disk/sdb/public/DA/CA64DA61306C00000000864f8117DA9200000A6%
getfattr: Removing leading '/' from absolute path names
# file: disk/sdb/public/DA/CA64DA61306C00000000864f8117DA9200000A6%
user.XrdFrm.Pfn="/data/dir1/file1"
By going through all files in /disk/sda
and /disk/sdb
, we can reconstruct the
directory tree under /data
.
Identify damaged files due to a server failure
Server failure should not automatically trigger data repair procedure. This is becasue the operation can continue without this failed server. And if the server failure is not due to disk failuer, no data are lost.
In the rare case when the server and its disks are all lost, one will need to check all files on other servers, and see which files are missing a zip file.
Repair damaged files
With a list of files to be repaired in hand, one can now start the repair procedure. This procedure can be summarized as the following steps:
- identity the files that need to be repaired
- copy each file to a new name e.g.
myfile.new
- (optionally) compare the checksum of the old and new files
- delete the old file
- rename the new file to the old name
There are many ways and tools to that can be used for the procedure. The following
describe how to use xrootdfs
for repairing.
-
mount the EC storage via
xrootdfs
in an host: e.g.xrootdfs /data -o nworkers=20 -o rdr=root://my_redirector//data -o direct_io
- Make sure that the XrdEC configuration file in the
Enabling EC in xrootd clients is availble.
The best way to check is to see if you get the correct size of a file in the
xrootdfs
mounted directory tree. - If you have 20 data servers, give 20 or a little more to
nworkers
- Make sure that the XrdEC configuration file in the
Enabling EC in xrootd clients is availble.
The best way to check is to see if you get the correct size of a file in the
-
Copying file. One can use any unix command for copying, for example:
dd if=myfile of=myfile.new bs=1M iflag=direct oflag=direct
. Copying tools that utilize direct IO for input and output, and large block size will perform better. - (optional) valide the the checksum. Usually existing files already have
some kind of checksum calculated and stored. For example, if adler32 is used,
one can use
/etc/xrootd/xrdadler32.sh
myfile
and/etc/xrootd/xrdadler32.sh myfile.new
to valid the checksum. - use 'rm' and 'mv' to delete the old file and rename the new file.
The following script will automate the above steps. It takes a to be repaired file (on xrootdfs), and repair its
$ cat repair.sh
#!/bin/sh
file=$1
CKSM="/etc/xrootd/xrdadler32.sh" # this is the checksum script mentioned above
nfile="${file}.newcopy"
dd if=$file of=$nfile bs=1M iflag=direct oflag=direct >/dev/null 2>&1
srccksm=$($CKSM $file)
dstcksm=$($CKSM $nfile)
if [ "$srccksm" == "$dstcksm" ]; then
echo [O] $file
rm $file
mv $nfile $file
exit
else
echo [X] $file
rm $nfile
exit 1
fi
Identify debris left behind
This happens when a data server is offline while a file was deleted, and the file was copied back later. XrdEC will automatically ignore these "debris" (zip files).
XrdEC SHOULD print out them for cleaning: more on this later
Identify new files for backup
XrdEC SHOULD log all newly create files: more on this later