Getting the Most Out of Your PHP Log Files: A Practical Guide

tejas

10 years ago

It could rightfully be said that logs are one of the most underestimated and underutilized tools at a freelance php developer’s disposal. Despite the wealth of information they can offer, it is not uncommon for logs to be the last place a developer looks when trying to resolve a problem.
In truth, PHP log files should in many cases be the first place to look for clues when problems occur. Often, the information they contain could significantly reduce the amount of time spent pulling out your hair trying to track down a gnarly bug.
But perhaps even more importantly, with a bit of creativity and forethought, your logs files can be leveraged to serve as a valuable source of usage information and analytics. Creative use of log files can help answer questions such as: What browsers are most commonly being used to visit my site? What’s the average response time from my server? What was the percentage of requests to the site root? How has usage changed since we deployed the latest updates? And much, much more.
PHP log files
This article provides a number of tips on how to configure your log files, as well as how to process the information that they contain, in order to maximize the benefit that they provide.

Although this article focuses technically on logging for PHP developers, much of the information presented herein is fairly “technology agnostic” and is relevant to other languages and technology stacks as well.

Note: This article presumes basic familiarity with the Unix shell. For those lacking this knowledge, an Appendix is provided that introduces some of the commands needed for accessing and reading log files on a Unix system.

Our PHP Log File Example Project

As an example project for discussion purposes in this article, we will take Symfony Standard as a working project and we’ll set it up on Debian 7 Wheezy with rsyslogd, nginx, and PHP-FPM.

composer create-project symfony/framework-standard-edition my "2.6.*"

This quickly gives us a working test project with a nice UI.

Tips for Configuring Your Log Files

Here are some pointers on how to configure your log files to help maximize their value.

Error Log Confguration

Error logs represent the most basic form of logging; i.e., capturing additional information and detail when problems occur. So, in an ideal world, you would want there to be no errors and for your error logs to be empty. But when problems do occur (as they invariably do), your error logs should be one of the first stops you make on your debugging trail.
Error logs are typically quite easy to configure.
For one thing, all error and crash messages can be logged in the error log in exactly the same format in which they would otherwise be presented to a user. With some simple configuration, the end user will never need to see those ugly error traces on your site, while devops will be still able to monitor the system and review these error messages in all their gory detail. Here’s how to setup this kind of logging in PHP:

log_errors = On
error_reporting = E_ALL
error_log = /path/to/my/error/log

Another two lines that are important to include in a log file for a live site, to preclude gory levels of error detail from being to presented to users, are:

display_errors = Off
display_startup_errors = Off

System Log (`syslog`) Confguration

There are many generally compatible implementations of the syslog daemon in the open source world including:

syslogd and sysklogd – most often seen on BSD family systems, CentOS, Mac OS X, and others
syslog-ng – default for modern Gentoo and SuSE builds
rsyslogd – widely used on the Debian and Fedora families of operating systems

(Note: In this article, we’ll be using rsyslogd for our examples.)
The basic syslog configuration is generally adequate for capturing your log messages in a system-wide log file (normally /var/log/syslog; might also be /var/log/messages or /var/log/system.log depending on the distro you’re using).
The system log provides several log facilities, eight of which (LOG_LOCAL0 through LOG_LOCAL7) are reserved for user-deployed projects. Here, for example, is how you might setup LOG_LOCAL0 to write to 4 separate log files, based on logging level (i.e., error, warning, info, debug):

# /etc/rsyslog.d/my.conf
local0.err      /var/log/my/err.log
local0.warning  /var/log/my/warning.log
local0.info     -/var/log/my/info.log
local0.debug    -/var/log/my/debug.log

Now, whenever you write a log message to LOG_LOCAL0 facility, the error messages will go to /var/log/my/err.log, warning messages will go to /var/log/my/warning.log, and so on. Note, though, that the syslog daemon filters messages for each file based on the rule of “this level and higher”. So, in the example above, all error messages will appear in all four configured files, warning messages will appear in all but the error log, info messages will appear in the info and debug logs, and debug messages will only go to debug.log.
One additional important note; The - signs before the info and debug level files in the above configuration file example indicate that writes to those files should be perfomed asynchronously (since these operations are non-blocking). This is typically fine (and even recommended in most situations) for info and debug logs, but it’s best to have writes to the error log (and most prpobably the warning log as well) be synchronous.
In order to shut down a less important level of logging (e.g., on a production server), you may simply redirect related messages to /dev/null (i.e., to nowhere):

local0.debug    /dev/null # -/var/log/my/debug.log

One specific customization that is useful, especially to support some of the PHP log file parsing we’ll be discussing later in this article, is to use tab as the delimiter character in log messages. This can easily be done by adding the following file in /etc/rsyslog.d:

# /etc/rsyslog.d/fixtab.conf
$EscapeControlCharactersOnReceive off

And finally, don’t forget to restart the syslog daemon after you make any configuration changes in order for them to take effect:

service rsyslog restart

Server Log Confguration

Unlike application logs and error logs that you can write to, server logs are exclusively written to by the corresponding server daemons (e.g., web server, database server, etc.) on each request. The only “control” you have over these logs is to the extent that the server allows you to configure its logging functionality. Though there can be a lot to sift through in these files, they are often the only way to get a clear sense of what’s going on “under the hood” with your server.
Let’s deploy our Symfony Standard example application on nginx environment with MySQL storage backend. Here’s the nginx host config we will be using:

server {
    server_name my.log-sandbox;
    root /var/www/my/web;
    location / {
        # try to serve file directly, fallback to app.php
        try_files $uri /app.php$is_args$args;
    }
    # DEV
    # This rule should only be placed on your development environment
    # In production, don't include this and don't deploy app_dev.php or config.php
    location ~ ^/(app_dev|config)\.php(/|$) {
        fastcgi_pass unix:/var/run/php5-fpm.sock;
        fastcgi_split_path_info ^(.+\.php)(/.*)$;
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param HTTPS off;
    }
    # PROD
    location ~ ^/app\.php(/|$) {
        fastcgi_pass unix:/var/run/php5-fpm.sock;
        fastcgi_split_path_info ^(.+\.php)(/.*)$;
        include fastcgi_params;
        fastcgi_param SCRIPT_FILENAME $document_root$fastcgi_script_name;
        fastcgi_param HTTPS off;
        # Prevents URIs that include the front controller. This will 404:
        # https://domain.tld/app.php/some-path
        # Remove the internal directive to allow URIs like this
        internal;
    }
    error_log /var/log/nginx/my_error.log;
    access_log /var/log/nginx/my_access.log;
}

With regard to the last two directives above: access_log represents the general requests log, while error_log is for errors, and, as with application error logs, it’s worth setting up extra monitoring to be alerted to problems so you can react quickly.
Note: This is an intentionally oversimplified nginx config file that is provided for example purposes only. It pays almost no attention to security and performance and shouldn’t be used as-is in any “real” environment.
This is what we get in /var/log/nginx/my_access.log after typing https://my.log-sandbox/app_dev.php/ in browser and hitting Enter.

192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /app_dev.php/ HTTP/1.1" 200 6715 "-" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/framework/css/body.css HTTP/1.1" 200 6657 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/framework/css/structure.css HTTP/1.1" 200 1191 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/acmedemo/css/demo.css HTTP/1.1" 200 2204 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/acmedemo/images/welcome-quick-tour.gif HTTP/1.1" 200 4770 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/acmedemo/images/welcome-demo.gif HTTP/1.1" 200 4053 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /bundles/acmedemo/images/welcome-configure.gif HTTP/1.1" 200 3530 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:28 +0300] "GET /favicon.ico HTTP/1.1" 200 6518 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"
192.168.56.1 - - [26/Apr/2015:16:13:30 +0300] "GET /app_dev.php/_wdt/e50d73 HTTP/1.1" 200 13265 "http://my.log-sandbox/app_dev.php/" "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_3) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/42.0.2311.90 Safari/537.36"

This shows that, for serving one page, the browser actually performs 9 HTTP calls. 7 of those, however, are requests to static content, which are plain and lightweight. However, they still take network resources and this is what can be optimized by using various sprites and minification techniques.
While those optimisations are to be discussed in another article, what’s relavant here is that we can log requests to static contents separately by using another location directive for them:

location ~ \.(jpg|jpeg|gif|png|ico|css|zip|tgz|gz|rar|bz2|pdf|txt|tar|wav|bmp|rtf|js)$ {
    access_log /var/log/nginx/my_access-static.log;
}

Remember that nginx location performs simple regular expression matching, so you can include as many static contents extensions as you expect to dispatch on your site.
Parsing such logs is no different than parsing application logs.

Other Logs Worth Mentioning

Two other PHP logs worth mentioning are the debug log and data storage log.

The Debug Log

Another convenient thing about nginx logs is the debug log. We can turn it on by replacing the error_logline of the config with the following (requires that the nginx debug module be installed):

error_log /var/log/nginx/my_error.log debug;

The same setting applies for Apache or whatever other webserver you use.
And incidentally, debug logs are not related to error logs, even though they are configured in the error_logdirective.
Although the debug log can indeed be verbose (a single nginx request, for example, generated 127KB of log data!), it can still be very useful. Wading through a log file may be cumbersome and tedious, but it can often quickly provide clues and information that greatly help accelerate the debugging process.
In particular, the debug log can really help with debugging nginx configurations, especially the most complicated parts, like location matching and rewrite chains.
Of course, debug logs should never be enabled in a production environment. The amount of space they use also and the amount of information that they store means a lot of I/O load on your server, which can degrade the whole system’s performance significantly.

Data Storage Logs

Another type of server log (useful for debugging) is data storage logs. In MySQL, you can turn them on by adding these lines:

[mysqld]
general_log = 1
general_log_file = /var/log/mysql/query.log

These logs simply contain a list of queries run by the system while serving database requests in chronological order, which can be helpful for various debugging and tracing needs. However, they should not stay enabled on production systems, since they will generate extra unnecessary I/O load, which affects performance.

Writing to Your Log Files

PHP itself provides functions for opening, writing to, and closing log files (openlog(), syslog(), and closelog(), respectively).
There are also numerous logging libraries for the PHP developer, such as Monolog (popular among Symfonyand Laravel users), as well as various framework-specific implementations, such as the logging capabilities incorporated into CakePHP. Generally, libraries like Monolog not only wrap syslog() calls, but also allow using other backend functionality and tools.
Here’s a simple example of how to write to the log:

<?php
openlog(uniqid(), LOG_ODELAY, LOG_LOCAL0);
syslog(LOG_INFO, 'It works!');

Our call here to openlog:

configures PHP to prepend a unique identifier to each system log message within the script’s lifetime
sets it to delay opening the syslog connection until the first syslog() call has occurred
sets LOG_LOCAL0 as the default logging facility

Here’s what the contents of the log file would look like after running the above code:

# cat /var/log/my/info.log
Mar  2 00:23:29 log-sandbox 54f39161a2e55: It works!

Maximizing the Value of Your PHP Log Files

Now that we’re all good with theory and basics, let’s see how much we can get from logs making as few changes as possible to our sample Symfony Standard project.
First, let’s create the scripts src/log-begin.php (to properly open and configure our logs) and src/log-end.php (to log information about successful completion). Note that, for simplicity, we’ll just write all messages to the info log.

# src/log-begin.php
<?php
define('START_TIME', microtime(true));
openlog(uniqid(), LOG_ODELAY, LOG_LOCAL0);
syslog(LOG_INFO, 'BEGIN');
syslog(LOG_INFO, "URI\t{$_SERVER['REQUEST_URI']}");
$browserHash = substr(md5($_SERVER['HTTP_USER_AGENT']), 0, 7);
syslog(LOG_INFO, "CLIENT\t{$_SERVER['REMOTE_ADDR']}\t{$browserHash}"); <br />
# src/log-end.php
<?php
syslog(LOG_INFO, "DISPATCH TIME\t" . round(microtime(true) - START_TIME, 2));
syslog(LOG_INFO, 'END');

And let’s require these scripts in app.php:

<?php
require_once(dirname(__DIR__) . '/src/log-begin.php');
syslog(LOG_INFO, "MODE\tPROD");
# original app.php contents
require_once(dirname(__DIR__) . '/src/log-end.php');

For the development environment, we want to require these scripts in app_dev.php as well. The code to do so would be the same as above, except we would set the MODE to DEV rather than PROD.
We also want to track what controllers are being invoked, so let’s add one more line in Acme\DemoBundle\EventListener\ControllerListener, right at the beginning of the ControllerListener::onKernelController() method:

syslog(LOG_INFO, "CONTROLLER\t" . get_class($event->getController()[0]));

Note that these changes total a mere 15 extra lines of code, but can collectively yield a wealth of information.

Analyzing the Data in Your Log Files

For starters, let’s see how many HTTP requests are required to serve each page load.
Here’s the info in the logs for one request, based on the way we’ve configured our logging:

Mar  3 12:04:20 log-sandbox 54f58724b1ccc: BEGIN
Mar  3 12:04:20 log-sandbox 54f58724b1ccc: URI    /app_dev.php/
Mar  3 12:04:20 log-sandbox 54f58724b1ccc: CLIENT 192.168.56.1    1b101cd
Mar  3 12:04:20 log-sandbox 54f58724b1ccc: MODE   DEV
Mar  3 12:04:23 log-sandbox 54f58724b1ccc: CONTROLLER Acme\DemoBundle\Controller\WelcomeController
Mar  3 12:04:25 log-sandbox 54f58724b1ccc: DISPATCH TIME  4.51
Mar  3 12:04:25 log-sandbox 54f58724b1ccc: END
Mar  3 12:04:25 log-sandbox 54f5872967dea: BEGIN
Mar  3 12:04:25 log-sandbox 54f5872967dea: URI    /app_dev.php/_wdt/59b8b6
Mar  3 12:04:25 log-sandbox 54f5872967dea: CLIENT 192.168.56.1    1b101cd
Mar  3 12:04:25 log-sandbox 54f5872967dea: MODE   DEV
Mar  3 12:04:28 log-sandbox 54f5872967dea: CONTROLLER Symfony\Bundle\WebProfilerBundle\Controller\ProfilerController
Mar  3 12:04:29 log-sandbox 54f5872967dea: DISPATCH TIME  4.17
Mar  3 12:04:29 log-sandbox 54f5872967dea: END

So now we know that each page load is actually served with two HTTP requests.
Actually there are two points worth mentioning here. First, the two requests per page load is for using Symfony in dev mode (which I have done throughout this article). You can identify dev mode calls by searching for /app-dev.php/ URL chunks. Second, let’s say each page load is served with two subsequent requests to the Symfony app. As we saw earlier in the nginx access logs, there are actually more HTTP calls, some of which are for static content.
OK, now let’s surf a bit on the demo site (to build up the data in the log files) and let’s see what else we can learn from these logs.

How many requests were served in total since the beginning of the logfile?

# grep -c BEGIN info.log
10

Did any of them fail (did the script shut down without reaching the end)?

# grep -c END info.log
10

We see that the number of BEGIN and END records match, so this tells us that all of the calls were successful. (If the PHP script had not completed successfully, it would not have reached execution of the src/log-end.php script.)

What was the percentage of requests to the site root?

# `grep -cE "\s/app_dev.php/$" info.log`
2

This tells us that there were 2 page loads of the site root. Since we previously learned that (a) there are 2 requests to the app per page load and (b) there were a total of 10 HTTP requests, the percentage of requests to the site root was 40% (i.e., 2×2/10).

Which controller class is responsible for serving requests to site root?

# grep -E "\s/$|\s/app_dev.php/$" info.log | head -n1
Mar  3 12:04:20 log-sandbox 54f58724b1ccc: URI  /app_dev.php/
# grep 54f58724b1ccc info.log | grep CONTROLLER
Mar  3 12:04:23 log-sandbox 54f58724b1ccc: CONTROLLER   Acme\DemoBundle\Controller\WelcomeController

Here we used the unique ID of a request to check all log messages related to that single request. We thereby were able to determine that the controller class responsible for serving requests to site root is Acme\DemoBundle\Controller\WelcomeController.

Which clients with IPs of subnet 192.168.0.0/16 have accessed the site?

# grep CLIENT info.log | cut -d":" -f4 | cut -f2 | sort | uniq
192.168.56.1

As expected in this simple test case, only my host computer has accessed the site. This is of course a very simplistic example, but the capability that it demonstrates (of being able to analyse the sources of the traffic to your site) is obviously quite powerful and important.

How much of the traffic to my site has been from FireFox?

Having 1b101cd as the hash of my Firefox User-Agent, I can answer this question as follows:

# grep -c 1b101cd info.log
8
# grep -c CLIENT info.log
10

Answer: 80% (i.e., 8/10)

What is the percentage of requests that yielded a “slow” response?

For purposes of this example, we’ll define “slow” as taking more than 5 seconds to provide a response. Accordingly:

# grep "DISPATCH TIME" info.log | grep -cE "\s[0-9]{2,}\.|\s[5-9]\."
2

Answer: 20% (i.e., 2/10)

Did anyone ever supply GET parameters?

# grep URI info.log | grep \?

No, Symfony standard uses only URL slugs, so this also tells us here that no one has attempted to hack the site.
These are just a handful of relatively rudimentary examples of the ways in which logs files can be creatively leveraged to yield valuable usage information and even basic analytics.

Other Things to Keep in Mind

Keeping Things Secure

Another heads-up is for security. You might think that logging requests is a good idea, in most cases it indeed is. However, it’s important to be extremely careful about removing any potentially sensitive user information before storing it in the log.

Fighting Log File Bloat

Since log files are text files to which you always append information, they are constantly growing. Since this is a well-known issue, there are some fairly standard approaches to controlling log file growth.
The easiest is to rotate the logs. Rotating logs means:

Periodically replacing the log with a new empty file for further writing
Storing the old file for history
Removing files that have “aged” sufficiently to free up disk space
Making sure the application can write to the logs uniterrupted when these file changes occur

The most common solution for this is logrotate, which ships pre-installed with most *nix distributions. Let’s see a simple configuration file for rotating our logs:

/var/log/my/debug.log
/var/log/my/info.log
/var/log/my/warning.log
/var/log/my/error.log
{
    rotate 7
    daily
    missingok
    notifempty
    delaycompress
    compress
    sharedscripts
    postrotate
        invoke-rc.d rsyslog rotate > /dev/null
    endscript
}

Another, more advanced approach is to make rsyslogd itself write messages into files, dynamically created based on current date and time. This would still require a custom solution for removal of older files, but lets devops manage timeframes for each log file precisely. For our example:

$template DynaLocal0Err,        "/var/log/my/error-%$NOW%-%$HOUR%.log"
$template DynaLocal0Info,       "/var/log/my/info-%$NOW%-%$HOUR%.log"
$template DynaLocal0Warning,    "/var/log/my/warning-%$NOW%-%$HOUR%.log"
$template DynaLocal0Debug,      "/var/log/my/debug-%$NOW%-%$HOUR%.log"
local1.err      -?DynaLocal0Err
local1.info     -?DynaLocal0Info
local1.warning  -?DynaLocal0Warning
local1.debug    -?DynaLocal0Debug

This way, rsyslog will create an individual log file each hour, and there won’t be any need for rotating them and restarting the daemon. Here’s how log files older than 5 days can be removed to accomplish this solution:

find /var/log/my/ -mtime +5 -print0 | xargs -0 rm

Remote Logs

As the project grows, parsing information from logs gets more and more resource hungry. This not only means creating extra server load; it also means creating peak load on the CPU and disk drives at the times when you parse logs, which can degrade server response time for users (or in a worst case can even bring the site down).
To solve this, consider setting up a centralized logging server. All you need for this is another box with UDP port 514 (default) open. To make rsyslogd listen to connections, add the following line to its config file:

$UDPServerRun 514

Having this, setting up the client is then as easy as:

*.*   @HOSTNAME:514

(where HOSTNAME is the host name of your remote logging server).

Conclusion

While this article has demonstrated some of the creative ways in which log files can offer way more valuable information than you may have previously imagined, it’s important to emphasize that we’ve only scratched the surface of what’s possible. The extent, scope, and format of what you can log is almost limitless. This means that – if there’s usage or analytics data you want to extract from your logs – you simply need to log it in a way that will make it easy to subsequently parse and analyze. Moreover, that analysis can often be performed with standard Linux command line tools like grep, sed, or awk.
Indeed, PHP log files are a most powerful tool that can be of tremendous benefit.

Resources

Code on GitHub: https://github.com/isanosyan/toptal-blog-logs-post-example

Appendix: Reading and Manipulating Log Files in the Unix Shell

Here is a brief intro to some of the more common *nix command line tools that you’ll want to be familiar with for reading and manipulating your log files.

cat is perhaps the most simple one. It prints the whole file to the output stream. For example, the following command will print logfile1 to the console:
```
cat logfile1
```
> character allows user to redirect output, for example into another file. Opens target stream in write mode (which means wiping target contents). Here’s how we replace contents of tmpfile with contents of logfile1:
```
cat logfile1 > tmpfile
```
>> redirects output and opens target stream in append mode. Current contents of target file will be preserved, new lines will be added to the bottom. This will append logfile1 contents to tmpfile:
```
cat logfile1 >> tmpfile
```
grep filters file by some pattern and prints only matching lines. Command below will only print lines of logfile1 containing Bingo message:
```
grep Bingo logfile1
```
cut prints contents of a single column (by number starting from 1). By default searches for tab characters as delimiters between column. For example, if you have file full of timestamps in format YYYY-MM-DD HH:MM:SS, this will allow you to print only years:
```
cut -d"-" -f1 logfile1
```
head displays only the first lines of a file
tail displays only the last lines of a file
sort sorts lines in the output
uniq filters out duplicate lines
wc counts words (or lines when used with the -l flag)
| (i.e., the “pipe” symbol) supplies output from one command as input to the next. Pipe is very convenient for combining commands. For example, here’s how we can find months of 2014 that occur within a set of timestamps:
```
grep -E "^2014" logfile1 | cut -d"-" -f2 | sort | uniq
```

Here we first match lines against regular expression “starts with 2014”, then cut months. Finally, we use combination of sort and uniq to print occurrences only once.