What are the challenges to have a good database?

By Clément Val, CEESAR

1. A quality data acquisition system (DAS)

First and foremost, there is no ‘good’ database without ‘good’ data! This is obvious, but never easy. In most Naturalistic Driving Studies (NDS) and Field Operational Tests (FOT) projects, we’re still pushing the envelope of what is technically possible. Sometimes, due to the amount and complexity of data, performance is an issue. We currently work on a project where we use raw video and LIDAR sensors. This represents hundreds of megabytes each second! Most of the time, however, difficulty comes from trying to perform apparently trivial data collection, like controller area network (CAN) data and compressed video, but in a continuous and automated way, on hundreds of vehicles and for very little money.

I am yet to be part of a project where we could just buy an off-the-shelf system and deploy it on a fleet. This is either because there are not suitable products commercially available or those are too expensive. As a result, we generally end up designing systems based on several heterogeneous components and software. However, building something that just works, especially in a vehicle environment, is anything but stress-free.

Nowadays, most people have smartphones or other technological wonders, and assume that if those can do what they do, it must be really easy, with today’s technology to just collect data in a car. Unfortunately, few people understand that thousands of developers may have worked on a single, apparently simple consumer product... This often leads to underestimation of the work and resources required to build a decent DAS. Therefore, I would point as the first challenge to set a DAS that is reliable and can perform in a consistent way, over long periods of time. Experience has taught me that ambitions need to be aligned with funding. This may be one of the most important jobs of an NDS/FOT project manager.Researchers will always want the biggest, richest dataset, but if ambitions are too high, they will only get unreliable data, after huge delays, which will hinder the development of high quality analysis. My second recommendation would be not to overlook acquisition system validation: know precisely what you want, specify it, and define a proper validation plan. Even then, of course, you can expect surprises along the road. Thus, do not overlook pilots either!

 

2. A clean and fit-for-purpose database

The second challenge is making sure that you properly deal with the inevitable hiccups of the logging system, and that you correctly prepare the data before you upload it. The aim here is not to upload junk into the database.

Should you upload trips where most sensors are missing? Should you pad missing data with null values? If you have some vehicles with 3 states for a function and others with 5 states for the same function, should you try to harmonize them or should you keep them separated? Should you resample this signal? With which algorithm? There is never a simple and definitive answer. All these difficult choices must be made in a conscious way, knowing which kind of impact this may have on the final analysis. What is also clear is that data quality needs to be assessed before it is used... At that stage, I have a very simple recommendation: analysts must work hand in hand with those who have deep technical understanding of vehicles, sensors and data collection tools.

 

3. Work on the substance… and the form!

IllustrationThe third challenge is to organise the database in a way which will make it usable, both in legibility and performance. The first thing to do is not technical, but nonetheless very important: the database must be properly documented. It should be easy to find what you are looking for and conversely, when you are looking at a column of data, you should always be able to know what is its origin (i.e. how it was acquired, pre-processed, calculated, what is the unit etc.).

A dedicated chapter on metadata can be found on the FOT-Net Data Data Sharing Framework. That is really, really important. Defining and using naming conventions is a must as well. You should always make sure that the exact same data is not in different places with different names. It seems obvious, but when you are on a large collaborative project, this requires some coordination. A responsible should always be designated to be accountable for the consistency of the database.

Good practices in database management should always be strictly followed. In a nutshell:

  • There should be a clear structure.
  • Redundancy should be avoided
  • Strong relations should be favoured. Even a relational database system can be used in the worst way possible, without proper relations, and using ‘flexible’ datatypes (blob, text…), which remove the ‘hassle’ but also the benefit of structuring the database. 

Again, this is easier said than done. When we are collecting hundreds of years of driving data, we are really speaking of big-data. The problem is that if the database is ‘too’ relational, ‘too’ structured, it will not scale well and some queries will just be too slow or even impossible to perform. So here a trade-off has to be found, between structure and performance. It means using relational systems in a slightly different way, and/or combing them with NoSQL systems, file storage…

In such case, you’ll need something else to enforce consistency. This ‘something else’ will most always be a combination of software and procedures. What you cannot enforce directly by the database schema, enforce it as much as possible by writing some dedicated code to do it for you automatically and in a repeatable way. Here at CEESAR, we develop a tool called SALSA whose aim is exactly that: automating most of processing and database management, depending on the intent of researchers. It will be used in UDRIVE. And finally, for what needs to be done manually, rely on qualified people, following good practices and previously defined procedures. Database administrator is a real job title and there is a reason for that…

 

About Clément Val

Clement ValClément Val is a research engineer with close to 15 years of experience in the automotive industry. He has spent the last decade at CEESAR on projects dealing with driver behaviour and their relation to the vehicle’s environment. Among the topics covered by his work stand traffic safety, driving assistance systems and autonomous driving.

CEESAR is a non-profit French organisation focused on road safety. Their role in UDRIVE includes the specification and oversight of the DAS development, its adaptation to the passenger cars employed in the study, data collection in France and development of data pre-processing and analysis tools. They also managed data collection and processing for the French site in the EUROFOT project.

For more information, contact This email address is being protected from spambots. You need JavaScript enabled to view it.