Storage is stuck at "Device is BLOCKED waiting to create a volume"

classic Classic list List threaded Threaded
6 messages Options
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Storage is stuck at "Device is BLOCKED waiting to create a volume"

Zdeněk Bělehrádek
Hi,

We are using Bacula to back up our company's data. All storages are
ordinary Debian Jessie Linux servers with spinning disks, we don't use
tapes. Bacula version is 7.0.5+dfsg-4~bpo80+1 and
7.4.3+dfsg-1+sid1~bpo8+1 (we tried both).

We need 2 copies of each backup placed in separate datacenters, so we
run periodic Copy jobs to mirror data between storages. We want to use
odd-numbered storages to make a backup, and then copy it to
even-numbered storage.

Our current configuration suffers from occasional deadlocks, when Bacula
tries to read and write from single storage. I thought it is probably
caused by mistakes in config, where storages have he same Media Type (as
documented at
http://www.bacula.org/7.4.x-manuals/en/main/Migration_Copy.html#SECTION002830000000000000000
).

For this reason we decided to create new config where every storage have
different type from every other. When I tested this new config in
testing environment, jobs got stuck and never finished.
status storage=bacst2-stor showed:

    Device is BLOCKED waiting to create a volume for:
       Pool:        zdenek-test-pp_old-full-pool-mirror
       Media type:  File-storspec-mirror
    Available Space=5.323 GB

and never making progress - the device is unusable for all jobs (they
simply wait). I tried mount and label a new volume, it didnẗ made any
difference.  The only thig that helps is to restart the storage daemon,
which makes the stuck job fail.

Strace of storage daemon on bacst2 revealed that director connects to
it, both authenticate to each other and storage sends "\0\0\0\0223000 OK
Hello 305\n" to director. Storage then reads from socket and never gets
any reply - thread just blocks in read() syscall indefinitely.

Strace of director confirms this - thread connects to storage,
authenticates, reads Hello and then never reply. Instead it opens
communication with bacst1 and starts sending commands. Even after
several minutes (test backups are several KB in size and usually
finishes in few seconds) the network socket to bacst2 is still open and
no communication is taking place.

I verified this with tcpdump and there's nothing suspicious - the
connection works normally, last packet sent is the Hello message
described above. Communication on that four-tuple then simply stops,
nobody sends anything, never closing the connection.
There is no firewall or NAT between the servers - they are connected to
single internal network.

I also tried to upgrade our 7.0 install to latest 7.4 from Debian,
results are exactly the same.

Configuration and strace output are at:
https://drive.google.com/file/d/0B4bjslETcBa-ZHVkOHU4dlZCZ2s/view?usp=sharing

I can reliably replicate the issue by running (on director):

for i in `seq 1 2` ; do
for job in bacst1_storage-job --bacst1_storage-incremental-job-mirror \
--bacst1_storage-full-job-mirror bacdir1_director-job \
--bacdir1_director-incremental-job-mirror \
--bacdir1_director-full-job-mirror ; do
echo "run job=$job yes" | bacula-console ; done ; done

Is this a known problem? Is there any workaround?

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/bacula-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Storage is stuck at "Device is BLOCKED waiting to create a volume"

Zdeněk Bělehrádek
User optiz0r at irc helped me to get trace files for all daemons, its at
http://filebin.ca/3HoXMMcEo2rv/traces.tar.gz

The configuration used may be slightly different (only difference I can
think of is setting Attribute Spooling = yes).

We noticed following errors:
bacst1-sd.trace:bacst1-sd: device.c:232-1 getvolinfo failed. No new Vol:
Error getting Volume info: 1998 Volume "bacst1_storage-full-vol-0001"
catalog status is Used, but should be Append, Purged or Recycle.

In this run, the error is reported for every volume except
bacdir1_director-full-vol-0005, which is also the only volume that has
other status than Used (is Append). Maybe it is significant?


Dne 29.3.2017 v 16:42 Zdeněk Bělehrádek napsal(a):

> Hi,
>
> We are using Bacula to back up our company's data. All storages are
> ordinary Debian Jessie Linux servers with spinning disks, we don't use
> tapes. Bacula version is 7.0.5+dfsg-4~bpo80+1 and
> 7.4.3+dfsg-1+sid1~bpo8+1 (we tried both).
>
> We need 2 copies of each backup placed in separate datacenters, so we
> run periodic Copy jobs to mirror data between storages. We want to use
> odd-numbered storages to make a backup, and then copy it to
> even-numbered storage.
>
> Our current configuration suffers from occasional deadlocks, when Bacula
> tries to read and write from single storage. I thought it is probably
> caused by mistakes in config, where storages have he same Media Type (as
> documented at
> http://www.bacula.org/7.4.x-manuals/en/main/Migration_Copy.html#SECTION002830000000000000000
> ).
>
> For this reason we decided to create new config where every storage have
> different type from every other. When I tested this new config in
> testing environment, jobs got stuck and never finished.
> status storage=bacst2-stor showed:
>
>     Device is BLOCKED waiting to create a volume for:
>        Pool:        zdenek-test-pp_old-full-pool-mirror
>        Media type:  File-storspec-mirror
>     Available Space=5.323 GB
>
> and never making progress - the device is unusable for all jobs (they
> simply wait). I tried mount and label a new volume, it didnẗ made any
> difference.  The only thig that helps is to restart the storage daemon,
> which makes the stuck job fail.
>
> Strace of storage daemon on bacst2 revealed that director connects to
> it, both authenticate to each other and storage sends "\0\0\0\0223000 OK
> Hello 305\n" to director. Storage then reads from socket and never gets
> any reply - thread just blocks in read() syscall indefinitely.
>
> Strace of director confirms this - thread connects to storage,
> authenticates, reads Hello and then never reply. Instead it opens
> communication with bacst1 and starts sending commands. Even after
> several minutes (test backups are several KB in size and usually
> finishes in few seconds) the network socket to bacst2 is still open and
> no communication is taking place.
>
> I verified this with tcpdump and there's nothing suspicious - the
> connection works normally, last packet sent is the Hello message
> described above. Communication on that four-tuple then simply stops,
> nobody sends anything, never closing the connection.
> There is no firewall or NAT between the servers - they are connected to
> single internal network.
>
> I also tried to upgrade our 7.0 install to latest 7.4 from Debian,
> results are exactly the same.
>
> Configuration and strace output are at:
> https://drive.google.com/file/d/0B4bjslETcBa-ZHVkOHU4dlZCZ2s/view?usp=sharing
>
> I can reliably replicate the issue by running (on director):
>
> for i in `seq 1 2` ; do
> for job in bacst1_storage-job --bacst1_storage-incremental-job-mirror \
> --bacst1_storage-full-job-mirror bacdir1_director-job \
> --bacdir1_director-incremental-job-mirror \
> --bacdir1_director-full-job-mirror ; do
> echo "run job=$job yes" | bacula-console ; done ; done
>
> Is this a known problem? Is there any workaround?
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/bacula-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Storage is stuck at "Device is BLOCKED waiting to create a volume"

Kern Sibbald
Hello,

The error you are getting should never happen, which means that
something is seriously wrong with your Bacula installation.  A few of
the multiple possibilities are:

1. Your DIR and SDs are not on the same version.  They *must* all be the
same. With the little information you provided, for the moment this
seems to be the most likely problem.

2. Your catalog is damaged.

3. Your Retention periods are too short and records are being removed
from the catalog.

4. You have manually modified your catalog, so that now the records are
not consistent.

5. Your catalog does not correspond to the Bacula Director version you
are running.  This should be detected, but perhaps the catalog was later
manually modified.

6. Either manually or some program is removing Volume records from the
catalog or changing them (this point is probably a duplication of point 4)

Best regards,

Kern



On 04/03/2017 06:31 PM, Zdeněk Bělehrádek wrote:

> User optiz0r at irc helped me to get trace files for all daemons, its at
> http://filebin.ca/3HoXMMcEo2rv/traces.tar.gz
>
> The configuration used may be slightly different (only difference I can
> think of is setting Attribute Spooling = yes).
>
> We noticed following errors:
> bacst1-sd.trace:bacst1-sd: device.c:232-1 getvolinfo failed. No new Vol:
> Error getting Volume info: 1998 Volume "bacst1_storage-full-vol-0001"
> catalog status is Used, but should be Append, Purged or Recycle.
>
> In this run, the error is reported for every volume except
> bacdir1_director-full-vol-0005, which is also the only volume that has
> other status than Used (is Append). Maybe it is significant?
>
>
> Dne 29.3.2017 v 16:42 Zdeněk Bělehrádek napsal(a):
>> Hi,
>>
>> We are using Bacula to back up our company's data. All storages are
>> ordinary Debian Jessie Linux servers with spinning disks, we don't use
>> tapes. Bacula version is 7.0.5+dfsg-4~bpo80+1 and
>> 7.4.3+dfsg-1+sid1~bpo8+1 (we tried both).
>>
>> We need 2 copies of each backup placed in separate datacenters, so we
>> run periodic Copy jobs to mirror data between storages. We want to use
>> odd-numbered storages to make a backup, and then copy it to
>> even-numbered storage.
>>
>> Our current configuration suffers from occasional deadlocks, when Bacula
>> tries to read and write from single storage. I thought it is probably
>> caused by mistakes in config, where storages have he same Media Type (as
>> documented at
>> http://www.bacula.org/7.4.x-manuals/en/main/Migration_Copy.html#SECTION002830000000000000000
>> ).
>>
>> For this reason we decided to create new config where every storage have
>> different type from every other. When I tested this new config in
>> testing environment, jobs got stuck and never finished.
>> status storage=bacst2-stor showed:
>>
>>      Device is BLOCKED waiting to create a volume for:
>>         Pool:        zdenek-test-pp_old-full-pool-mirror
>>         Media type:  File-storspec-mirror
>>      Available Space=5.323 GB
>>
>> and never making progress - the device is unusable for all jobs (they
>> simply wait). I tried mount and label a new volume, it didnẗ made any
>> difference.  The only thig that helps is to restart the storage daemon,
>> which makes the stuck job fail.
>>
>> Strace of storage daemon on bacst2 revealed that director connects to
>> it, both authenticate to each other and storage sends "\0\0\0\0223000 OK
>> Hello 305\n" to director. Storage then reads from socket and never gets
>> any reply - thread just blocks in read() syscall indefinitely.
>>
>> Strace of director confirms this - thread connects to storage,
>> authenticates, reads Hello and then never reply. Instead it opens
>> communication with bacst1 and starts sending commands. Even after
>> several minutes (test backups are several KB in size and usually
>> finishes in few seconds) the network socket to bacst2 is still open and
>> no communication is taking place.
>>
>> I verified this with tcpdump and there's nothing suspicious - the
>> connection works normally, last packet sent is the Hello message
>> described above. Communication on that four-tuple then simply stops,
>> nobody sends anything, never closing the connection.
>> There is no firewall or NAT between the servers - they are connected to
>> single internal network.
>>
>> I also tried to upgrade our 7.0 install to latest 7.4 from Debian,
>> results are exactly the same.
>>
>> Configuration and strace output are at:
>> https://drive.google.com/file/d/0B4bjslETcBa-ZHVkOHU4dlZCZ2s/view?usp=sharing
>>
>> I can reliably replicate the issue by running (on director):
>>
>> for i in `seq 1 2` ; do
>> for job in bacst1_storage-job --bacst1_storage-incremental-job-mirror \
>> --bacst1_storage-full-job-mirror bacdir1_director-job \
>> --bacdir1_director-incremental-job-mirror \
>> --bacdir1_director-full-job-mirror ; do
>> echo "run job=$job yes" | bacula-console ; done ; done
>>
>> Is this a known problem? Is there any workaround?
>>
> ------------------------------------------------------------------------------
> Check out the vibrant tech community on one of the world's most
> engaging tech sites, Slashdot.org! http://sdm.link/slashdot
> _______________________________________________
> Bacula-users mailing list
> [hidden email]
> https://lists.sourceforge.net/lists/listinfo/bacula-users


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/bacula-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Storage is stuck at "Device is BLOCKED waiting to create a volume"

Zdeněk Bělehrádek
Hi, thanks for your reply.

Ad 1: they are the same, specifically 7.4.3+dfsg-1+sid1~bpo8+1 from
jessie-backports (I just verified it). For this test, even the FDs were
this version.

Ad 2: I worked with clean catalog:
 - stop director and storages
 - psql: drop database bacula
 - psql: create database bacula owner bacula
 - PGPASSWORD=XXXXX db_name=bacula
/usr/share/bacula-director/make_postgresql_tables -U bacula -h
bacdb1.cent -d bacula
 - start director and storages, enable trace

To be sure, i checked PostgreSQL logs, and there is only one error,
repeating every time bacula runs a job:
Apr  3 20:00:01 bacdb1 postgres[10867]: [24-1] 2017-04-03 20:00:01 CEST
[10867-43] bacula@bacula ERROR:  table "delcandidates" does not exist
Apr  3 20:00:01 bacdb1 postgres[10867]: [24-2] 2017-04-03 20:00:01 CEST
[10867-44] bacula@bacula STATEMENT:  DROP TABLE DelCandidates

I don't know why bacula tries to delete nonexistent tables, but looking
to the source code, this query is used only when pruning jobs to clean
up temporary tables. I think it is harmless.

I ran dbcheck against my catalog, and it found 2 orphaned clients (one
is not accessible in testing env and not needed, one have it's job
stuck) and 2 orphaned filesets (both have jobs that didn't run yet). So
no errors there either.

The server is OpenStack virtual server running on our infrastructure,
there were no crashes nor any problems I know of.

Is there any other way to check for catalog damage?

Ad 3: I run the jobs manualy after setting up new catalog, it takes only
few minutes. My retention periods are 7 days minimum.

Ad 4: I do not edit the catalog manually. I was using bacula-web to
display contents of the catalog, so to be sure I just re-run the test
with clean catalog and bacula-web disabled and the bug is still here.

Ad 5: I created it fresh by running make_postgresql_tables (from bacula
package) in empty database.

root@bacdir1:~# dpkg -S /usr/share/bacula-director/make_postgresql_tables
bacula-director-pgsql: /usr/share/bacula-director/make_postgresql_tables
root@bacdir1:~# dpkg -l bacula-director-pgsql | grep "^ii"
ii  bacula-director-pgsql                     7.4.3+dfsg-1+sid1~bpo8+1
amd64                     network backup service - PostgreSQL storage
for Director
[PREP]root@bacdir1:~# grep
/usr/share/bacula-director/make_postgresql_tables -e Version
INSERT INTO Version (VersionId) VALUES (15);

Ad 6: there are 3 programs that could do it automatically: bacula
director, bacula-web (I disabled it) and nagios check (we don't run
nagios in test environment). I am quite sure nobody except bacula can do
it. And yes, I am sure no of my co-workers could mess with catalog
either, I did ask.


Looking at the above, I am starting to think it may be a bug in Bacula.
Should i report it? Where?

With regards,
Zdeněk Bělehrádek

Dne 3.4.2017 v 18:51 Kern Sibbald napsal(a):

> Hello,
>
> The error you are getting should never happen, which means that
> something is seriously wrong with your Bacula installation.  A few of
> the multiple possibilities are:
>
> 1. Your DIR and SDs are not on the same version.  They *must* all be the
> same. With the little information you provided, for the moment this
> seems to be the most likely problem.
>
> 2. Your catalog is damaged.
>
> 3. Your Retention periods are too short and records are being removed
> from the catalog.
>
> 4. You have manually modified your catalog, so that now the records are
> not consistent.
>
> 5. Your catalog does not correspond to the Bacula Director version you
> are running.  This should be detected, but perhaps the catalog was later
> manually modified.
>
> 6. Either manually or some program is removing Volume records from the
> catalog or changing them (this point is probably a duplication of point 4)
>
> Best regards,
>
> Kern
>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/bacula-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Storage is stuck at "Device is BLOCKED waiting to create a volume"

Kern Sibbald
Hello,

Well, I am out of ideas.

Yes, Bacula has a bugs database, and you can report it, but at this
point it appears unlikely that it is a bug otherwise someone else would
have the same problem.  I will need to have a way to reproduce the
problem. You can try turning on level 200 debug in the Director, and
when the problem arises, do an llist on all volumes (note that is llist
with double l).  Also provide your bacula-dir.conf and bacula-sd.conf.
That may show some problem. The main point is for you to prove that
there are other suitable Volumes that are available.  If doing those
things does not uncover a problem, and I cannot reproduce it (currently
the case), there will not be much more that I can do.

Best regards,

Kern


On 04/04/2017 01:48 PM, Zdeněk Bělehrádek wrote:

> Hi, thanks for your reply.
>
> Ad 1: they are the same, specifically 7.4.3+dfsg-1+sid1~bpo8+1 from
> jessie-backports (I just verified it). For this test, even the FDs were
> this version.
>
> Ad 2: I worked with clean catalog:
>   - stop director and storages
>   - psql: drop database bacula
>   - psql: create database bacula owner bacula
>   - PGPASSWORD=XXXXX db_name=bacula
> /usr/share/bacula-director/make_postgresql_tables -U bacula -h
> bacdb1.cent -d bacula
>   - start director and storages, enable trace
>
> To be sure, i checked PostgreSQL logs, and there is only one error,
> repeating every time bacula runs a job:
> Apr  3 20:00:01 bacdb1 postgres[10867]: [24-1] 2017-04-03 20:00:01 CEST
> [10867-43] bacula@bacula ERROR:  table "delcandidates" does not exist
> Apr  3 20:00:01 bacdb1 postgres[10867]: [24-2] 2017-04-03 20:00:01 CEST
> [10867-44] bacula@bacula STATEMENT:  DROP TABLE DelCandidates
>
> I don't know why bacula tries to delete nonexistent tables, but looking
> to the source code, this query is used only when pruning jobs to clean
> up temporary tables. I think it is harmless.
>
> I ran dbcheck against my catalog, and it found 2 orphaned clients (one
> is not accessible in testing env and not needed, one have it's job
> stuck) and 2 orphaned filesets (both have jobs that didn't run yet). So
> no errors there either.
>
> The server is OpenStack virtual server running on our infrastructure,
> there were no crashes nor any problems I know of.
>
> Is there any other way to check for catalog damage?
>
> Ad 3: I run the jobs manualy after setting up new catalog, it takes only
> few minutes. My retention periods are 7 days minimum.
>
> Ad 4: I do not edit the catalog manually. I was using bacula-web to
> display contents of the catalog, so to be sure I just re-run the test
> with clean catalog and bacula-web disabled and the bug is still here.
>
> Ad 5: I created it fresh by running make_postgresql_tables (from bacula
> package) in empty database.
>
> root@bacdir1:~# dpkg -S /usr/share/bacula-director/make_postgresql_tables
> bacula-director-pgsql: /usr/share/bacula-director/make_postgresql_tables
> root@bacdir1:~# dpkg -l bacula-director-pgsql | grep "^ii"
> ii  bacula-director-pgsql                     7.4.3+dfsg-1+sid1~bpo8+1
> amd64                     network backup service - PostgreSQL storage
> for Director
> [PREP]root@bacdir1:~# grep
> /usr/share/bacula-director/make_postgresql_tables -e Version
> INSERT INTO Version (VersionId) VALUES (15);
>
> Ad 6: there are 3 programs that could do it automatically: bacula
> director, bacula-web (I disabled it) and nagios check (we don't run
> nagios in test environment). I am quite sure nobody except bacula can do
> it. And yes, I am sure no of my co-workers could mess with catalog
> either, I did ask.
>
>
> Looking at the above, I am starting to think it may be a bug in Bacula.
> Should i report it? Where?
>
> With regards,
> Zdeněk Bělehrádek
>
> Dne 3.4.2017 v 18:51 Kern Sibbald napsal(a):
>> Hello,
>>
>> The error you are getting should never happen, which means that
>> something is seriously wrong with your Bacula installation.  A few of
>> the multiple possibilities are:
>>
>> 1. Your DIR and SDs are not on the same version.  They *must* all be the
>> same. With the little information you provided, for the moment this
>> seems to be the most likely problem.
>>
>> 2. Your catalog is damaged.
>>
>> 3. Your Retention periods are too short and records are being removed
>> from the catalog.
>>
>> 4. You have manually modified your catalog, so that now the records are
>> not consistent.
>>
>> 5. Your catalog does not correspond to the Bacula Director version you
>> are running.  This should be detected, but perhaps the catalog was later
>> manually modified.
>>
>> 6. Either manually or some program is removing Volume records from the
>> catalog or changing them (this point is probably a duplication of point 4)
>>
>> Best regards,
>>
>> Kern
>>


------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/bacula-users
Reply | Threaded
Open this post in threaded view
|  
Report Content as Inappropriate

Re: Storage is stuck at "Device is BLOCKED waiting to create a volume"

Zdeněk Bělehrádek
Hi,

1. it is a 5 concurrently started copies of:
 a) 2 backup jobs
 b) 2 Copy jobs that copies Full backups from a)
 c) 2 Copy jobs that copies Incremental backups from a)
I can sometimes replicate the problem with just a 2 copies of the above,
but this has been about 90 % reliable. The problem doesn't occur every
time, only when there is a lot of jobs at once.

2. this commands instructs your Docker daemon to download, but not run,
my containers  from Docker Hub:

docker pull lyco/debug-bacdb:2017-04-07
docker pull lyco/debug-bacdir:2017-04-07
docker pull lyco/debug-bacst1:2017-04-07
docker pull lyco/debug-bacst:2017-04-07

you can skip this and just run them as described in point 4, because
Docker daemon will download them for you.

Of course you have to install docker package first, detailed guide is e.
g. at
https://docs.docker.com/engine/installation/linux/ubuntu/#install-using-the-repository

3. 150 to 250 MB each. They should require only few megabytes of RAM,
except the bacdb container: PostgreSQL in it is configured with 256 MB
of shared_buffers. I am pretty sure it will be runnable at any
reasonable developer machine.

4.
docker run -d --network bactest --network-alias bacdb1.cent
lyco/debug-bacdb:2017-04-07
docker run -d --network bactest --network-alias bacdir1.cent
lyco/debug-bacdir:2017-04-07
docker run -d --network bactest --network-alias bacst1.cent
lyco/debug-bacst1:2017-04-07
docker run -d --network bactest --network-alias bacst2.cent
lyco/debug-bacst:2017-04-07

Now you have the containers running in background. If you want to run
any command in container, you have to know ID of running container:

docker ps

then you can run an interactive shell in container:

docker exec -it <container_id> bash

You might want to install strace, gdb etc in it.
When container exits, Docker normally throws away any changs made in it.

Explanation of flags:
-d: detach (run in background)
--network: connect containers to this virtual network (name is arbitrary)
--network-alias: give container DNS name in the virtual network
-it: interactive, allocate tty

5. no, this is a misunderstanding. While Google does have some services
that can run containers, you don't need them (and you would have to pay
for it).

Basically, container is "super chroot" - you run a process in isolated
environment using your own kernel (i. e. it's not virtualized), but
maybe with different libraries, config files, network setup or mounted
disks. Because of this, you gain reproducibility - you can run the same
binary in the same environment, no matter what underlaying system is  -
as long as it is reasonably new Linux. You only need the container
images - files with filesystem and metadata needed to run them. This is
what I uploaded to Docker Hub, and what is named
lyco/debug-bacst:2017-04-07 etc.

What I uploaded at Google Drive is a set of scripts and files that you
need if you want to recreate my images. It is for you to see what I did
to make them, what exact software etc. I used, and to make easier to
test any changes.

P. S.: I totally meant to post this to bacula-users and original poster
too. Sorry, reposting to list.

Dne 7.4.2017 v 20:41 Kern Sibbald napsal(a):

> Hello,
>
> Well, it sounds like you have been working hard.
>
> I am doing my development on a Ubuntu 16.04 machine, so I imagine it can
> handle docker containers as well as a lot of other stuff. However, I
> have never used a container, and I am assuming that you want me to do
> so.  I am willing to try, but here are a few questions:
>
> 1. It seems like you need four images and there are apparently 5 Bacula
> jobs I need to start. Is that correct?
>
> 2. What is the command I would use to get the images downloaded to my
> machine?
>
> 3. Approximately how big is their total size?
>
> 4. Once I have them here, what is the command(s) I use to start them?
>
> 5. You seem to say that I can run them on a google drive.  How do I do
> that?
>
> I am a bit concerned.  This seems to be a very big setup -- that is not
> something simple.  I'll take a look at it, but if it is overly complex,
> please don't count on me.  I don't have much spare time, and I don't do
> support work (takes much too long), but if I can clearly see a bug,
> there is a good chance that I can fix it.
>
> The typical test situation that I deal with is anything similar to the
> test files in <bacula-source>/regress/tests.  Your setup for the moment
> seems to be more complicated (hopefully I am mistaken).
>
> Best regards,
>
> Kern
>
>
>
> On 04/07/2017 05:00 PM, Zdeněk Bělehrádek wrote:
>> Hi,
>>
>> I managed to replicate the problem in a set of Docker containers  based
>> upon the semiofficial Debian Jessie container and jessie-backports
>> packages.
>>
>> Images are:
>> REPOSITORY                 TAG                 IMAGE ID
>> lyco/debug-bacdb           2017-04-07          345265e86294
>> lyco/debug-bacst2          2017-04-07          6e355bf8c0ba
>> lyco/debug-bacst1          2017-04-07          f9ff4567bd26
>> lyco/debug-bacdir          2017-04-07          e1565bff29ec
>>
>> Strat them with command
>> ./run 2017-04-07
>> from tarball below (or see note), exec bash into bacdir and run
>>
>> for i in `seq 1 5` ; do \
>> for job in bacst1_storage-job \
>> --bacst1_storage-incremental-job-mirror \
>> --bacst1_storage-full-job-mirror bacdir1_director-job \
>> --bacdir1_director-incremental-job-mirror \
>> --bacdir1_director-full-job-mirror ; do \
>> echo "run job=$job yes" | bacula-console ; done ; done
>>
>> This is meant to simulate situation when long running backup delays
>> other jobs until Copy jobs start running too. This kind of situation
>> happens in our production too and is source of problems that forced me
>> to write this new config I am trying to debug.
>>
>> If you want to recreate the containers yourselves (e. g. to check there
>> isn't any problem with my packages etc.), you can download the scripts
>> and configs that I am using to create these containers as a tarball:
>>
>> https://drive.google.com/file/d/0B4bjslETcBa-c0M4N3hueDg2OEE/view?usp=sharing
>>
>>
>> The configuration is copied from testing environment, looks like the
>> config that I would like to use in production, and has been changes only
>> minimally (enabled logging to files, enabled access to DB not based on
>> hostnames). The containers themselves aren't exactly best practices
>> showcase (things like using shell instead of init), but it shouldn't
>> matter for Bacula.
>>
>> Don't worry about passwords, I already changed them in my setup.
>>
>> Note: the run command just runs the images in common network with DNS
>> names bacdir1.cent, bacdb1.cent, bacst1.cent and bacst2.cent.
>>
>> Dne 4.4.2017 v 16:09 Kern Sibbald napsal(a):
>>> Hello,
>>>
>>> Well, I am out of ideas.
>>>
>>> Yes, Bacula has a bugs database, and you can report it, but at this
>>> point it appears unlikely that it is a bug otherwise someone else would
>>> have the same problem.  I will need to have a way to reproduce the
>>> problem. You can try turning on level 200 debug in the Director, and
>>> when the problem arises, do an llist on all volumes (note that is llist
>>> with double l).  Also provide your bacula-dir.conf and bacula-sd.conf.
>>> That may show some problem. The main point is for you to prove that
>>> there are other suitable Volumes that are available.  If doing those
>>> things does not uncover a problem, and I cannot reproduce it (currently
>>> the case), there will not be much more that I can do.
>>>
>>> Best regards,
>>>
>>> Kern
>>>
>>>
>>> On 04/04/2017 01:48 PM, Zdeněk Bělehrádek wrote:
>>>> Hi, thanks for your reply.
>>>>
>>>> Ad 1: they are the same, specifically 7.4.3+dfsg-1+sid1~bpo8+1 from
>>>> jessie-backports (I just verified it). For this test, even the FDs were
>>>> this version.
>>>>
>>>> Ad 2: I worked with clean catalog:
>>>>    - stop director and storages
>>>>    - psql: drop database bacula
>>>>    - psql: create database bacula owner bacula
>>>>    - PGPASSWORD=XXXXX db_name=bacula
>>>> /usr/share/bacula-director/make_postgresql_tables -U bacula -h
>>>> bacdb1.cent -d bacula
>>>>    - start director and storages, enable trace
>>>>
>>>> To be sure, i checked PostgreSQL logs, and there is only one error,
>>>> repeating every time bacula runs a job:
>>>> Apr  3 20:00:01 bacdb1 postgres[10867]: [24-1] 2017-04-03 20:00:01 CEST
>>>> [10867-43] bacula@bacula ERROR:  table "delcandidates" does not exist
>>>> Apr  3 20:00:01 bacdb1 postgres[10867]: [24-2] 2017-04-03 20:00:01 CEST
>>>> [10867-44] bacula@bacula STATEMENT:  DROP TABLE DelCandidates
>>>>
>>>> I don't know why bacula tries to delete nonexistent tables, but looking
>>>> to the source code, this query is used only when pruning jobs to clean
>>>> up temporary tables. I think it is harmless.
>>>>
>>>> I ran dbcheck against my catalog, and it found 2 orphaned clients (one
>>>> is not accessible in testing env and not needed, one have it's job
>>>> stuck) and 2 orphaned filesets (both have jobs that didn't run yet). So
>>>> no errors there either.
>>>>
>>>> The server is OpenStack virtual server running on our infrastructure,
>>>> there were no crashes nor any problems I know of.
>>>>
>>>> Is there any other way to check for catalog damage?
>>>>
>>>> Ad 3: I run the jobs manualy after setting up new catalog, it takes
>>>> only
>>>> few minutes. My retention periods are 7 days minimum.
>>>>
>>>> Ad 4: I do not edit the catalog manually. I was using bacula-web to
>>>> display contents of the catalog, so to be sure I just re-run the test
>>>> with clean catalog and bacula-web disabled and the bug is still here.
>>>>
>>>> Ad 5: I created it fresh by running make_postgresql_tables (from bacula
>>>> package) in empty database.
>>>>
>>>> root@bacdir1:~# dpkg -S
>>>> /usr/share/bacula-director/make_postgresql_tables
>>>> bacula-director-pgsql:
>>>> /usr/share/bacula-director/make_postgresql_tables
>>>> root@bacdir1:~# dpkg -l bacula-director-pgsql | grep "^ii"
>>>> ii  bacula-director-pgsql                     7.4.3+dfsg-1+sid1~bpo8+1
>>>> amd64                     network backup service - PostgreSQL storage
>>>> for Director
>>>> [PREP]root@bacdir1:~# grep
>>>> /usr/share/bacula-director/make_postgresql_tables -e Version
>>>> INSERT INTO Version (VersionId) VALUES (15);
>>>>
>>>> Ad 6: there are 3 programs that could do it automatically: bacula
>>>> director, bacula-web (I disabled it) and nagios check (we don't run
>>>> nagios in test environment). I am quite sure nobody except bacula
>>>> can do
>>>> it. And yes, I am sure no of my co-workers could mess with catalog
>>>> either, I did ask.
>>>>
>>>>
>>>> Looking at the above, I am starting to think it may be a bug in Bacula.
>>>> Should i report it? Where?
>>>>
>>>> With regards,
>>>> Zdeněk Bělehrádek
>>>>
>>>> Dne 3.4.2017 v 18:51 Kern Sibbald napsal(a):
>>>>> Hello,
>>>>>
>>>>> The error you are getting should never happen, which means that
>>>>> something is seriously wrong with your Bacula installation.  A few of
>>>>> the multiple possibilities are:
>>>>>
>>>>> 1. Your DIR and SDs are not on the same version.  They *must* all
>>>>> be the
>>>>> same. With the little information you provided, for the moment this
>>>>> seems to be the most likely problem.
>>>>>
>>>>> 2. Your catalog is damaged.
>>>>>
>>>>> 3. Your Retention periods are too short and records are being removed
>>>>> from the catalog.
>>>>>
>>>>> 4. You have manually modified your catalog, so that now the records
>>>>> are
>>>>> not consistent.
>>>>>
>>>>> 5. Your catalog does not correspond to the Bacula Director version you
>>>>> are running.  This should be detected, but perhaps the catalog was
>>>>> later
>>>>> manually modified.
>>>>>
>>>>> 6. Either manually or some program is removing Volume records from the
>>>>> catalog or changing them (this point is probably a duplication of
>>>>> point 4)
>>>>>
>>>>> Best regards,
>>>>>
>>>>> Kern
>>>>>
>

------------------------------------------------------------------------------
Check out the vibrant tech community on one of the world's most
engaging tech sites, Slashdot.org! http://sdm.link/slashdot
_______________________________________________
Bacula-users mailing list
[hidden email]
https://lists.sourceforge.net/lists/listinfo/bacula-users
Loading...