MonALISA quick start tutorial

Last updated: February, 2008

Introduction:

This tutorial will show some hands on examples based on the MonALISA documentation. It is not meant to replace it but rather to augment it and show a step by step example of how to use some of the MonALISA functionality. It is by no means complete, nor is it intend to be. It provides hands on examples while for details it will refer to the MonALISA documentation.

This tutorial is not perfect and is based on the MonALISA version I was using at the time. Software evolves and I hope to keep this tutorial up to date (and improve it). If you have questions you can contact the MonALISA support mailing list.

While going through this tutorial (or perhaps before it) you might ask yourself: Why do you need such an elaborate system for gathering monitoring information?.

There are at least four (there might be more) justifications for such a system: (1) It can be viewed as a framework in which you publish information. You notice in the tutorial that most of the work is configuring (and reusing) components rather than building them. Many of the obvious tasks (publication, storage, and presentation of monitoring information relies on configuration. (2) It is a global deployed system. If you are deploying a system with many components that interact (and affect) each other they can all hookup to MonALISA and create global views from the Monitoring data enabling you to correlate parameters and take appropriate actions. (3) It has proven to be extremely scalable. You can look at the latest statistics at the looking glass page. (4) It is distributed by nature, making it less likely to completely collapse when there is for example a power outage at a site. Some of the components will be offline but the system will continue to function.

Questions and comments about this tutorial can be send to Frank van Lingen . Special thanks goes out to Costin Grigoras and the MonALISA support team for helping me to setup and use MonALISA. Parts of this tutorial is almost verbatim copied from email replies from Costin.

For the impatient:

If you do not want to go through the whole tutorial but only want to learn about a specific part of the system you can skip certain sections. Below a list of sections that you need to read through for a particular part:

Tutorial Outline

Figure 1 shows the different components we will use during tutorial. First we setup a MonALISA service, and learn how to add a new module (ExTrigger.Java) to it that sends emails when a parameter reaches a certain threshold. After that we modify the modules' subscription and change it to a parameter we receive from an outside process via ApMon. When doing this publication we will use the webstart enabled interactive client to view some of the parameters we have published. The next step will be to setup a MonALISA repository, configure it and learn how we can define graphical reports that can be embedded into web pages. The next step will be to augment the subscription predicates with parameters that we injected in the publish/subscribe infrastructure via ApMon and the MonALISA service. After that we show how to use the SimpleClient.Java and augment it with trigger functionality. The two triggers discussed differ in that the first trigger (ExTrigger.Java) is localized to the MonALISA service and what it receives, while the second trigger is able to subscribe to events from all the farms that are hooked up to the publish/subscribe infrastructure via many MonALISA services. After getting acquainted with adding modules to the MonALISA service and repository (which happen to be triggers but might have had other functionality), we discuss the action/alert framework. This framework enables to define alerts with thresholds and actions. You will need no write no Java code in this section as it is all done through configuration files. The final step will be to initiate control actions from a repository to a service through a secure connection.

Monalisa overview
Figure 1. Overview of the different components we will use in this tutorial.

MonALISA Service

The MonALISA Service enables you to load customized plug ins that act on (service) local information. It has an in memory database to store monitoring information for (limited) time series analysis.

  • Note: if you run ML now from your home ISP it is possible that they block outgoing traffic to port 25 or that the the place where monalisa is hosted rejects these connections for sending email. You can also test if the OS can send emails by using this command: mail -v -s test subject (email address here)
  • Download the service package
  • Make sure you open the proper ports in the firewall (e.g. in /etc/sysconfig/iptables file for linux (check the ../Service/myFarm/ml.properties file where you can define the port ranges): 8884 (for ApMon this is udp), other ports are tcp.
  • Make sure that no other processes are listening to these ports. E.g. a "netstat -la |grep < port nrs> " to check if nobody is listening on the ports.
  • Untar it and run "install.sh" (and answer the questions correctly)
  • You can now start the MonALISA service by running : ../Service/CMD/ML_SER start
  • Other options for ML_SER: ./ML_SER [ start | stop | restart | version | update ]
  • More information on configuring the service can be found in the installation guide

Once you start the server you can activate the MonALISA interactive client where your service will show up after a few seconds when started successfully. The default group is the "test" group (unless you changed it in the config file: ../Service/myFarm/ml.properties, attribute: "lia.Monitor.group=test". The name of your farm (this term is used for historical reasons) is specified in the: ../Service/CMD/ml_env file (FARM_NAME= < your service name here>). You can check the ../Service/myFarm/ML0.log file to see if the service is running properly. In my particular case MonALISA could not adequeately determine the proper ip address probably due to the fact that I was running it in a virtual machine, and I had not opened my firewall properly. Figure 2 shows a screenshot of the interactive client where our service shows up. In its default mode there are two items in the cluster field (this term is also used for historical reasons): Master and MonALISA, and it contains some monitoring parameters for the service itself. Starting the service might take some time (few seconds) but the service is intended to run continuesly.

Monalisa interactive client
Figure 2. Interactive Client where our service shows up (franks.laptop)

Apmon (this assumes the MonALISA service is running)

If you have set it up to receive apmon messages (and have your firewall configured to receive them) you can send ApMon messages to this server. ApMon enables you to publish parameters in a variety of languages (C++,Java,Perl,Python) can be downloaded from this URL. For this tutorial we use the Python ApMon. The "examples" directory contains several examples, one of which we modified below:

import apmon
import time

# Initialize ApMon specifying that it should not send information about the system. 
# Note that in this case the background monitoring process isn't stopped, in case you may
# want later to monitor a job.
# use the ip address rather than the hostname as it is more reliable. You can insert
# multiple ip addresses. The addresses are where the MonALISA services are hosted.
# if your service runs on the same box where you apmon originates from you can use 'localhost'
# instead of the ip or hostname. This is especially useful if you change boxes, 
# as you do not need to update the apmon info every time.
apm = apmon.ApMon(('192.168.245.129',));
apm.setLogLevel("NOTICE");
apm.confCheck = False;
apm.enableBgMonitoring(False);
# this is a maximum rate to prevent (unintentional) over publishing.
apm.setMaxMsgRate(75);

for i in range(1,20000):
    # you can put as many pairs of parameter_name, parameter_value as you want
    # but be careful not to create packets longer than 8K.
    apm.sendParameters("SimpleCluster", "SimpleNode", {'var_i': i, 'ar_i^2': i*i});
    f = 20.0 / i;
    # send in the same cluster and node as last time;
    apm.sendParams({'var_f': f, '5_times_f': 5 * f, 'i+f': i + f});
    #print "simple_send-ing for i=",i
    time.sleep(0.1);
Save the example (e.g. simple_sample.py) and execute it (preferably on a different machine than where the MonALISA service is being hosted. Make sure you import the Apmon python library abailable on the MonALISA web page and then perform ("python simple_sample.py"). It will publish some information on the screen, but it will also send monitoring parameters to our MonALISA service. As the interactive client subscribes to our parameters the parameters are displayed there too (see figure 3). As the parametes are published relative to time, they can be displayed as time series.

Monalisa interactive client
Figure 3. Parameters published by our ApMon application viewed within the interactive client.

Once the ApMon messages are consumed by the service they also become available to other services that subscribe to particular events being generated in the system (see the repository section).

As you see, it is possible to send multiple monitoring records in one ApMon message. However, for practical reasons the length of an ApMon message should be less than 1500 B. There are local network configurations which may cut long UDP messages. ApMon takes care automatically to keep the UDP messages less than the "standard" MTU. If the user wants to send a large number of monitoring records in one message, and the XDR encoding for it exceeds 1500B, ApMon generates two or more UDP messages automatically.

If ApMon splits a monitoring message in several UDP messages, these are self contained. If one UDP packet is lost, it does not affect the information on the others. Each UDP message from an ApMon instance has a sequence number, and in case a UDP message is lost at least we know that from that instance and site there are problems. ApMon has build in the functionality to avoid flooding with UDP in case it is used by mistake in an infinite loop. It limits the rate of sending UDP messages. This works fine if you use on instance of ApMon . Of course if someone start a lot of instances at the time this can create a large number of of messages at the same time.

Trigger module

We have now successfully set up the MonALISA service and used ApMon. Now can write additional modules. Modules can be catogorized in three types:

  • Monitoring modules (processing and publishing results)
  • Trigger/Alarm modules (subscribing and processing results)
  • Agent modules (complex interaction between agents and/or publishing/subscribing to results)

A detailed description of the purpose of these three types of modules can be found in the User guide

You can find examples of modules in the following directory: ../Service/usr_code directory. In this directory you find a set of dynamic modules that you can load dynamically into your MonALISA service by configuring the properties files (later more on this). The section below describes a trigger/alarm module example. How to write/edit it, compile it, and configure it.

  • Look at an available example in: Service/usr_code/FilterExamples/ExTrigger. This example subscribes to Load5 events and if reached a certain threshold will send an email.
  • Change the following items int the ExTrigger.java class (before doing that make a backup of the original (e.g. ExTrigger.java.original):
    • Change RCTP to your own email address
    • Change MAIL_DELAY_NOTIF to once every 5 minutes: private final static long MAIL_DELAY_NOTIF = 1 * 60 * 5 * 1000;
    • Change MAX_LOAD to .2 (so to quickly trigger an alert): public final static double MAX_LOAD = .2;
The predicate specified in the constructor specifies that this module subscribes to all Load5 events of all farms in the MonaLisa cluster field (we do not change it for the moment) that are associated to this service (not to all farms in the grid). The notifyResult method is overridden. In it it anlyzes events it is subscribed upon. If it reaches the threshold it will trigger the alarm. Note that we have a threshold for the number of emails per minute (one per 5 minutes as set above). Once we are satisfied with our trigger we can compile it, before compilation perform these three steps:
  • Go to ../Service/CMD dir.
  • source ml_env
  • Include the FarmMonitor.jar in your class path: CLASSPATH=$MonaLisa_HOME/Service/lib/FarmMonitor.jar
  • Than in the ExTrigger dir, do: javac ExTrigger.java (or execute the comp file if available)
  • Edit the following attributes in the ml.properties file (you add the newly compiled class to the
  • monalisa service):
    • lia.Monitor.CLASSURLs=file:${MonaLisa_HOME}/Service/usr_code/PBS/, file:${MonaLisa_HOME}/Service/usr_code/FilterExamples/ExTrigger/
    • lia.Monitor.ExternalFilters=ExTrigger
  • now do a: ML_SER restart
Check the logfile ML0.log to see if the ExTrigger module is loaded (grep for ExTrigger) If that is the case, check the ML.log to see if events are registerd. As we set the threshold quite low it should register some events (it should mention this in the ML.log file). Check also you inbox (or trashfolder if it is registered as junk mail) to see if you get some emails (the log should mention when it sends an email). Note: if you run ML now from your home ISP It is possible that they block outgoing traffic to port 25 or that the the place where monalisa is hosted rejects these connections for sending email.

Next we will change the parameter we listen to, to one we publish with our ApMon application of the previous section. Change the following code in the ExTrigger.java:

monPreds = new monPredicate[]{new monPredicate( "*", "SimpleCluster", "*", -1, -1, new String[]{"var_i"}, null )};
We now subscribe to "var_i".
  • Compile and restart the server "ML_SER restart"
  • Also start the apmon application (preferably on another node but it also works on the same node) outlined in the previous seciton that publishes the "var_i" parameter.
  • Check ML0.log if the ExTrigger started
  • Check ML.log to see if events are published.
As "var_i" only increases it will always reach our low threshold and thus send emails once every 5 minutes.

MonALISA Repository

The MonALISA repository can be viewed as a special kind of MonALISA service in that it: can subscribe to events published by farms associated to many MonALISA services, and it can store the time series of parameters it is subscribed to, and display it through web pages (all of this configurable).

  • Download the repository
  • Edit the firewall settings 5544 (postgress), 8080, 8006 (tomcat). You can change these ports if necessary in the conf/ dir. Note that 5544 and 8006 are only used on localhost and not accessed from the outside.
  • Do a netstat -la|grep < port nrs > to check if the ports are not already in use.
  • Execute ./install.sh
  • I had to change the tomcat port in the ../conf/web_server_config.xml file and also to update the env.MONALISA_WS (put in the proper hostname and port) and env.JAVA_HOME parameter
  • More information can be found in the User guide
  • Do ./start.sh
  • Go to: http://<your host name>:<your port>/ , and you will see something similar as in Figure 4.
  • NOTE: you can stop monalisa using the ./stop.sh script. However it has been observed that MonALISA does not always shutsdown, or if it does it restarts itself as the '.../script/verify.sh' script is put in the crontab. If removal of this script from the crontab fails (or it is not listed) it automatically restarts itself. To really force stopping the server (when you ./stop.sh does not work). Rename the ../script/verify.sh, run ./stop.sh and if that does not help use kill -9 to finish it. Once you restart the server put the verify.sh back.
When you install the repository and visit the web pages you see an administration link (lower left). You can set the password for that in: ../tomcat/conf/tomcat-users.xml. The administration functionality enables you to select the sites from which you want to receive monitoring information based on the groups you defined in the properties files. It also enables you to assign colour schemes to the sites.

Monalisa web repository
Figure 4. MonALISA Web Repository

The repository subscribes by default to several events from the test group (see interactive client) so you get a look and feel of the functionality. Click for example on Statistics-->Farm to see a table view (Figure 5):

Monalisa web repository
Figure 5. MonALISA Web Repository (table view)

Basic configuration of the repository is described in section 2 and 3 of the user guide. To create a chart plot or a statistics table point your browser to: http://<your host name>:<your port nr>/servlet?page=name where:

  • servlet - can be "display", "stats" or "genimage"
  • name - the name.properties file containing the specific properties of the desired chart/table, without the ".properties" extension. These configuration files are found in ../tomcat/webapps/ROOT/WEB-INF/conf/ directory.
The "display" servlet can be used to produce realtime plots such as bar, spider web or pie charts or history charts with lines and shapes or overlapping areas. The realtime plots use the latest available information from the Cache object. This object holds the last known values just for a limited amount of time. The next steps involve manipulating some of the web pages:
  • Edit the file: ../tomcat/webapps/ROOT/js/menu.js
  • Comment out:
     d.add(113,1,'Farms Usage','display?page=spider_usage');
  • Reload the front web page: http://<your hostname>:<your port>/ and the entry in the tree view has disappeared.
  • Now replace the above entry with
     d.add(113,1,'Load Master Nodes','display?page=sample_realtime');
  • You would see something like Figure 6.

Monalisa web repository
Figure 6. MonALISA Web Repository (inserted new menu item)

Subscribing and Vizualizing Events

To store and visualize monitoring information the repository needs to be subscribed to these events. Subscription to events is defined in the ../JStoreClient/conf/App.properties. In it you define the predicates the repository will receive and also which of them will be stored. Once you defined the messages, you can configur the plot properties files to display it in a web page.

sample_realtime.properties is a properties file in ../tomcat/webapps/ROOT/WEB-INF/conf that displays Load5 information from the Master clusters. Note that in the "Embedded/conf/App.properties" file we are subscribed through a predicate to Master events (*/Master/*)

Now comment out in the ../JStoreClient/conf/App.properties:

  • lia.Monitor.JiniClient.Store.global_params=Load5,TotalIO_Rate_IN,TotalIO_Rate_OUT,NoCPUs
  • lia.Monitor.JiniClient.Store.predicates=*/WAN/*/-1/-1/%_IN|%_OUT, */Master/*/-1/-1/Load5|%_IN|%_OUT,*/MonaLisa/*-1/-1/*
These lines specify to what events we subscribe, and if we are not subscribed to them the views (web pages) that use these events will become empty.
  • Restart the server. And wait for 5-10 minutes (depends on the configured refresh rate).
  • After a while you see something like Figure 7.

Monalisa web repository
Figure 7. MonALISA Web Repository (view after cancelled subscriptions).

  • Start the monalisa service again with our trigger (not the repository, but keep the repository running)
  • Activate our apmon application to publish parameters (see our apmon example) including the var_i parameter.
  • Edit the: JStoreClient/conf/App.properties file and set the following parameter (comment the older value out): lia.Monitor.JiniClient.Store.predicates=*/WAN/*/-1/-1/%_IN|%_OUT, */Master/*/-1/-1/Load5|%_IN|%_OUT, */MonaLisa/*-1/-1/*,*/SimpleCluster/*/-1/-1/var_i . This says that we subscribe to parameter "var_i" from clusters with name "SimpleCluster"
  • Copy the sample_realtime.properties to var_i.properties and replace "Master" with "SimpleCluster" and "Load5" with "var_i" (in the (in the ../tomcat/webapps/ROOT/WEB-INF/conf dir.)
  • Replace the menu item we inserted in ../tomcat/webapps/ROOT/js/menu.js with: d.add(113,1,'var_i values','display?page=var_i');
  • Go to the main page and click on "var_i values" where you see something like Figure 8:

Monalisa web repository
Figure 8. MonALISA Web Repository with our own ApMon based parameter and web page.

The properties files to configure the plots are very flexible. Below are some more examples of what you can configure

Notice that the value keeps rising, this has to do with the apmon application that increases the value of var_i.

We just saw how we can visualize real time data. The next steps show how to visualize historic data. Historic data is retrieved from the repository in which the real time data is stored. edit the: tomcat/webapps/ROOT/WEB-INF/conf/global.properties There is currently a bug in the plot library we use, you need to edit tomcat/webapps/ROOT/WEB-INF/conf/global.properties and change:

  • timezone=local
  • timeaxis=Anything you like displayed, for example 'local time'
  • Copy tomcat/webapps/ROOT/WEB-INF/conf/hist_link2.properties to tomcat/webapps/ROOT/WEB-INF/conf/var_i_hist.properties
  • edit the following in the var_i_hist.properties file:
    • Clusters=SimpleCluster
    • Functions=var_i
    • ylabel=var_i
    • title="var_i " on SimpleCluster Nodes
  • Now add: d.add(114,1,'var_i history','display?page=var_i_hist'); to ../tomcat/webapps/ROOT/js/menu.js
  • Make sure our apmon program and the monalisa service is running.
  • Go to the repository web page and you see the history item is added. Click on it and you see something like figure 9.

Monalisa web repository
Figure 9. Historical Data visualized.

The above showed us how we can subsribe to parameters. Although the application is called 'MonALISA repository' you can opt not to store the data in the associated database by configuring the ../JStoreClient/conf/App.properties parameter:

lia.Monitor.JiniClient.Store.dontStore

You can subscribe to a larger set of data and discard some of it with the "dontStore" parameter. You will still be able to see the last values in the cache ($C constructor), just as you can see it in the http://..........:..../dump_cache.jsp. As storages, the repository has:

  • 1) An in-memory cache of last values for each series, controlled by this parameter in App.properties: lia.web.Cache.RecentData = 120 (expiration in minutes)
  • 2) An in-memory history buffer, with a dynamic length depending on how much memory is allocated to JVM
  • 3) Real database on disk for persistent storage.
All structures are shared with the MonALISA service, but usually you don't have 3. enabled in a service, just to make it as light as possible.

1. will see everything you subscribe to, while 2. and 3. will receive everything but what is filtered out by the "dontStore" parameter. Actions use 1, so you can define actions but not store the data used to take the decisions.

The properties files to configure the plots are very flexible. Below are some more examples of this.

The first example below shows how we can select nodes by using queries from the repository. Nodes are stored as strings in the mi_key column in the monitor_ids table and have the format: <farm>/<cluster>/<node>/<parameter> (the same format as you see in the monalisa interactive client). We assign to the nodes a query which says: select all distinct entries from monitor_ids that are from Farm PhEDEx, from any cluster, whose node contains the substring 'RAL' and whos parameter is 'dRSS'. From these entries split by using the slash (this slash is used to separate the farm/cluster/node/parameter hiearchy), and take the third item (which is the node). Make sure you set the 'Wildcards=N' when doing this.

page=hist

#Farms option is already defined in global.properties
Farms=PhEDEx
Clusters=PhEDEx_RAL_Delta
Nodes=$QSELECT distinct split_part(mi_key,'/',3) FROM monitor_ids WHERE mi_key LIKE 'PhEDEx/%/%RAL%/dRSS';
Functions=dRSS
# possible values for the "Wildcards" option are :
# F : Farms
# C : Clusters
# N : Nodes
# f : functions
Wildcards=N
FIXME: more example

SimpleClient.java

The repository might not be suitable for every occasion. Sometimes you just want to collect data take some action and throw it away. Within the repository package you can use the ../Embedded/SimpleClient.java application to achieve this goal. Provided you have configured the environment variables correctly (../conf/eng.JAVAHOME, etc...) you can compile it (./compile.sh) and then run it (./run.sh) (it runs in the background but you can remove the & in the run.sh script).

The application creates a log file log.out wich publishes parameters it is subscribed to. Notice that this application other than our ExTrigger example at the beginning of the tutorial can subscribe to events from farms which injected their monitoring information through other services .

Do the following:

  • Edit the ../Embedded/conf/App.properties file
  • Set: lia.Monitor.Store.TransparentStoreFast.writer_0.writemode=3 (in memory writing)
  • Set: lia.Monitor.group = test (as this is the group we are in with our MonALISA service)
  • Set: lia.Monitor.JiniClient.Store.predicates=*/SimpleCluster/*. Subscribe to all events from farms that have a "SimpleCluster"
  • Start the SimpleClient again (make sure that our ApMon and MonALISA service are running), and watch the log.out file.
You notice that nothing is written in there. This is because in our SimpleClient.java file it specifies what parameters this application wants to handle (Load5,Load_15,...) and it does not contain a parameter from SimpleCluster. After a while there also appears another file: JStore*log If you look in this file you see indeed that the SimpleCluster entries are being created.

TriggerClient.java

The next step will be to modify the SimpleClient.java code to write "var_i" events to the log.out files and then to modify further to act as a trigger

  • Copy the SimpleClient.java to TriggerClient.java and run.sh to runTrigger.sh.
  • Rename SimpleClient to TriggerClient in runTrigger.sh.
  • Rename the class in TriggerClient.java from SimpleClient to TriggerClient.
  • Remove the predicates and replace it with:
     vWatchFor.add(new monPredicate("*", "*", "*", -1, -1, new String[]{"var_i"}, null));
  • Execute "compile.sh" and after that "runTrigger.sh". Make sure that the ApMon client and its associated MonALISA service are running.
You see that that now the TriggerCLient.java application handles our var_i parameters. Notice that the other SimpleCluster parameters are still logged in the JStore*log file. The next step will be to add email functionality to our TriggerClient. We will not explain the steps to do that but simply point to the modified trigger file. You can see which parts are modified by looking at the sections betwee START ADDED CODE and END ADDED CODE. Notice that this code is very similar to some of the ExTrigger.java code.

Compile the modified trigger class and run it. Make sure the ApMon client and its associated MonALISA service. are running.

While this is running you notice you get emails from two sources. (1) The ExTrigger module that still reports information and (2) The new TriggerClient. The first module has a local (farm) view and only deals with events generated within its associated MonALISA context, while the second module has a global view in that it can subscribe to events from many MonALISA services.

Action Framework

This section discusses the MonALISA action framwork. With this framework you can configure alerts without any programming as we did with the previous examples.

  • Add the following 2 lines to our apmon example code from one of the previous
  • section:
    • apm.sendParameters("SimpleCluster", "ApplicationX", {'param1': i%300});
    • apm.sendParameters("SimpleCluster", "ApplicationY", {'param1': i%300});
    It sends a param1 parameter with a certain value to our deployed MonALISA service and resets every 300 seconds (remember we publish once per second in our ApMon example).
  • Start our modified ApMon application.
  • Verify in the MonALISA interactive client if our service is indeed receiving the new values (figure 10).

Monalisa interactive client
Figure 10. Interactive Client with the param1 parameter with the intentionally up and down behavior.

  • Update the lia.Monitor.JiniClient.Store.predicates parameter in: JStoreClient/conf/App.properties of our repository deployment. add: */SimpleCluster/*/-1/-1/param1. This says that we want to receive param1 parameters from any farm that have a SimpleCluster from with any node value
  • Add the file param1_hist.properties to the /tomcat/webapps/ROOT/WEB-INF/conf directory (we use this file to verify if the repository receives the parameter)
  • Verify if our repository is indeed receiving the new values, by going to this page (replace the hostname and port with the one where you deployed it): http://frank.ultralight.org:8083/display?page=param1_hist (see figure 11).

Monalisa web repository
Figure 11. MonALISA repository page for param1 history plots.

  • Copy the file addDate.sh to a place where it can be executed by the repository. Note: You might want to do a "find -name "dateLog.txt" to see where the log file is created. In this particular case it ended up in: MLrepository/tomcat/bin/dateLog.txt
  • Create an action directory (e.g in the toplevel MonALISA dir) and copy this file to this directory.
  • The file contains some documentation about the various parameters. Read through it, and edit at least these parameters:
                # replace it with your MonALISA service name
    	    series.0=frank.ultralight.org_service
    	    # replace it with your email address (you can add more addresses by separating them
                # with a comma)
                action.0.to=fvlingen@caltech.edu
                # pick a place where the log can be written
                action.1.file=/home/monalisa/logs/param1.log
                # point it towards the addDate.sh script
    	    action.2.execute=/home/monalisa/commands/addDate.sh
    
  • Add a line to: JStoreClient/conf/App.properties to specify the action directory:
     lia.util.actions.base_folder=/home/monalisa/MLrepository/actions
  • If you want to see what is happening with the actions (in case of problems) you can enable the logging for this part in the App.properties file:
     
                  lia.util.actions.Action.level=FINEST
    	      lia.util.actions.ActionsManager.level=FINEST
    	
    When you have logging activated and active the action manager you see something like this in your log file:
            Mar 29, 2007 4:22:34 PM lia.util.actions.ActionsManager reload
    	INFO: Observing: 2 total files, 2 periodic actions and 0 event-based actions
    		
  • NOTE I : Make sure that when you add the action config parameters to the App.properties file that there is an action in that directory otherwise the action manager is not activated.
  • NOTE II:If you add or remove a .properties file from the actions folder you have to at least touch JStoreClient/conf/App.properties. This will trigger a reload of the action framework.
  • NOTE III: Make sure you can send emails from the machine the repository is running on. In the global App.properties file you can use the following options to tune the sending of the emails:
    lia.util.mail.CONNECT_TIMEOUT
    	integer, smtp connection timeout, in seconds, default 10
    
    lia.util.mail.SOCK_TIMEOUT
    	integer, smtp chat timeout, in seconds, default 40
    
    lia.util.mail.USE_LOCAL_MAIL
    	boolean, whether or not to use the "mail" command, default false
    
    lia.util.mail.LOCAL_MAIL_COMMAND
    	string, if the above option is true, what is the command to use for sending
    
    lia.util.mail.MailServer
    	string, IP or name of the server through which the emails are sent
    	default: use directly the MX of the destination domain
    
    lia.util.mail.MailServerPort
    	int, TCP port of the SMTP server, default 25
    
    To see more debugging info if you have problems you can enable more verbose logging of this class with:
    lia.util.mail.DirectMailSender=FINE|FINER|FINEST|ALL (default is INFO in the 
    standard configuration)
    
  • We are reade for alerts! The settings for the alerts are very low show we should get some alerts every 2-3 minutes. Let it run for a while to see what happens. Also monitor the JStoreClient/conf/log.out file which is the file to which action logging information is written.
After running our ApMon client the log file contains several entries:
2007-03-29 16:07:02: ApplicationY at frank.ultralight.org_service is error message through log
2007-03-29 16:07:03: ApplicationX at frank.ultralight.org_service is error message through log
2007-03-29 16:13:20: ApplicationY at frank.ultralight.org_service is success message through log
2007-03-29 16:13:20: ApplicationX at frank.ultralight.org_service is success message through log
2007-03-29 16:17:00: ApplicationY at frank.ultralight.org_service is error message through log
2007-03-29 16:17:01: ApplicationX at frank.ultralight.org_service is error message through log
2007-03-29 16:22:43: ApplicationY at frank.ultralight.org_service is success message through log
2007-03-29 16:22:43: ApplicationX at frank.ultralight.org_service is success message through log
2007-03-29 16:28:43: ApplicationY at frank.ultralight.org_service is error message through log
2007-03-29 16:28:44: ApplicationX at frank.ultralight.org_service is error message through log
2007-03-29 16:33:40: ApplicationY at frank.ultralight.org_service is success message through log
2007-03-29 16:33:40: ApplicationX at frank.ultralight.org_service is success message through log
2007-03-29 16:38:41: ApplicationY at frank.ultralight.org_service is error message through log
The MLrepository/tomcat/bin/dateLog.txt (this is file that is written into by executing a command based on an alert) also contains some entries:
Thu Mar 29 16:07:02 PDT 2007
Thu Mar 29 16:07:03 PDT 2007
Thu Mar 29 16:17:00 PDT 2007
Thu Mar 29 16:17:01 PDT 2007
Thu Mar 29 16:28:43 PDT 2007
Thu Mar 29 16:28:44 PDT 2007
Thu Mar 29 16:38:41 PDT 2007
Thu Mar 29 16:38:42 PDT 2007
Thu Mar 29 16:47:04 PDT 2007
Thu Mar 29 16:47:05 PDT 2007
Thu Mar 29 16:56:01 PDT 2007
And we also received several email messages regarding the alerts. Note that since we configured it to send messages on success and failure, getting an email does not mean something is wrong.

Monalisa alerts
Figure 12. Email alerts sent by our alert.

If you examine the action properties file that you put in the action directory, you find three parameters that describe when alerts are triggered:

period=60
threshold.success=2
threshold.error=3

period=60 means that the state is evaluated each minute (60 seconds). If we take the success threshold for example = 2, this means the state has to remain OK for two consecutive evaluations = 2 minutes. First time this happens, an email is sent. Default values for the thresholds (used when you don't explicitly declare them in the configuration file) are:

threshold.success=3
threshold.error=3
threshold.flip_flop=0

You can trigger the state evaluation either periodically or triggered by one new value of a series. Periodic evaluation is good when you have a steady flow of monitoring information, because you usually evaluate the cached value (last received value). While you can put some time constraints in the predicate to avoid situations when you actually have no data, in some situations this is not what you want. For example you can use:

$CvFarm/Cluster/Node/-60000/-1/Parameter;
to limit the time to the last minute and set
ignore_missing_data=true

to disable the actions when there is no data. Usually one combines this with another action that sends an alert notifying that the service that sends monitoring information is dead :)

Now, if you really have data that has no periodicity you can use instead of
period=60 
something like:
trigger=Farm/Cluster/Node/Parameter

This way the evaluation will be done only when one new value is received for that series. The trigger is unique, one trigger for one configuration file and you cannot use "variables" to define it.

Monalisa actions
Figure 13. Visualization of the workings of threshold parameters (thanks to Costin Grigoras for providing this plot)

Setting an alarm/action when no monitoring data is received

It is very useful to get alarms when no monitoring data is being received. The following example will explain how to do this. In the series parameter you specify the parameters and all other values in the properties file will be very similar to the one of the previous example, except for the rule and the ignore_missing_data values. You do not want to ignore data as this is an alert that needs to be triggered when something is missing. In the rule you specify if there was any value published for the parameter you are monitoring between now and a time in the past (lets say 35 minutes in the past). So the parameter we want to monitor can look something like this:

series.0=$QSELECT distinct mi_key FROM monitor_ids WHERE mi_key LIKE 'PhEDEx/SensorStatus/%/SensorStatus';

And the rule looks like this (where $Ct#0 describes the parameter from series.0):

rule=$Ezero_if_null($Ct#0;)>now()-300000

now() denotes the time now, while 300000 is the time in milliseconds we want to look back for non zero values of our parameter (300000 ms = 5 minutes). If we combine this with the treshold parameter and the period parameter we can calculate after which time we get an alert when nothing is received. E.g. if we use the rule above (check if we receive something the last 5 minutes), and period is 60 (1 minute) and threshold is 10 than we get an alarm after 5+1*10 minutes. You can find an example properties file here with slightly different numbers and some comments.

Command and Control

The previous sections showed how to publishe, and subscribe to messages and initiate simple alerts (e.g. emails) in a distributed monitoring system. This section will describe how repositories that are subscribed to message can initiate actions on the services in the system in a secure way.

FIXME: Expand this section

Retrieving data directly from the MonALISA backend (database)

This section is intended for experts only (be warned ;) ). It documents how you can retrieve data directly from the backend and query the backend.

The default backend for MonALISA is Postgres. You can use the Postgres command line client to access the database with the default passwords as configured in the MonALISA config files: psql --port=5544 --dbname=mon_data --username=mon_user. If you want to find what databases (and users) are in this database you can do: psql --port=5544 -l. To find the tables of particular database do (in the psql client): \bg Below is a list of commands you can give:

\? list all commands

\dn list all schemas (multiple per database)

     List of schemas
        Name        |  Owner
--------------------+----------
 information_schema | monalisa
 pg_catalog         | monalisa
 pg_toast           | monalisa
 public             | monalisa


\l list all databases

     List of databases
   Name    |  Owner   | Encoding
-----------+----------+----------
 mon_data  | mon_user | UNICODE
 template0 | monalisa | UNICODE
 template1 | monalisa | UNICODE


\d abping_aliases

# select information from abping
select * from abping_aliases;

If you want to see the schema more up close and personal you can do a schema dump: pg_dump --port=5544 --schema-only --username=mon_user mon_data. If you remove the schema only argument you dump the complete databases. The following query gives you some idea on what keys it is receiving: select * from monitor_ids where mi_key='< some parameter >' , which can be used as additional information to fine grained log information (although the latter is usually sufficient).